AD-A168  738 


AFOSR-TR-  86-  0  279 


COMPUTER  SCIENCE 
TECHNICAL  REPORT  SERIES 


Approved  for  publi»  releaa® } 
distribution  unlimited. 


CL- 

B 


UNIVERSITY  OF  MARYLAND 

COLLEGE  PARK,  MARYLAND 
20742 

DTIC 


Approved  for  publi*  r*l«as®  T 
distribution  unlinited. 


jums 

D 


=D 


86  6  10 


>  v 


May  1985 


► 


» 


TeCh"lsaI  ^ePorc~15oo 


Evaluations  of  Soft- 

nologies;  r...! fCware  Tech- 


and  Metrics 


Testin 


S»  CLEA3TROOM 


S1Chard  “•  Soiby,  Jr. 
Department  of  rn»K 
University  0f 

College  Park  d 


AIR  F0H'*1  0?T!rt  17  S^ITOITiriC  RX SXLW  { A.7SC) 

NG.ri rs  rr  .7  vt,v ;  r-i  n 

Thlii  *  •  .  ;  •  :  •  n*<3  is 

app  tfr’ -IS. 

DMt "lb. 

¥.A J.  S-i.v'S.A 

Chi»f,  T*thni«al  Inf onsstlon  Division 


.“llun  =>uomitted  m  ^ 

ot  che  .1  «'  -  Graduace 
th=  r«bulrenle„ts“d  -  P«tlai  fuiaa, 
Ooccsr  of  PhUosopt.  egree  of 


- - _ 

1C-V  ot  Maryland. 


i“  ‘ 


ABSTRACT 


Title  of  Dissertation:  Evaluations  of  Software  Technologies:  Testing, 

CLEANROOM,  and  Metrics 


Richard  Wayne  Selby,  Jr.,  Doctor  of  Philosophy,  1985 


Dissertation  directed  by:  Victor  R.  Baslll 

Professor  and  Chairman 
Department  of  Computer  Science 


The  evaluation  of  software  technologies  suffers  because  of  the  lack  ofNauan- 

\  j 

tltatlve  assessment  of  their  effect  on  software  development  and  modification.’  A 


seven-step  approach  for  quantitatively  evaluating  software  technologies  couples 


/ 

software  methodology  evaluation  with  software  measurement.  The  approach  Is 
applied  ln-depth  lnatire-ft7tf owing "ttlTee  aTeas.  1)  Software  Testing  Strategies:  A 
74-subject  study,  Including  32  professional  programmers  and  42  advanced 
university  students,  compared  code  reading,  functional  testing,  and  structural 
testing  In  a  fractional  factorial  design.  2)  Cleanroom  Software  Development: 
Fifteen  three-person  teams  separately  built  a  1200-line  message  system  to  com¬ 
pare  Cleanroom  software  development  (in  which  software  is  developed  complete¬ 
ly  off-line)  with  a  more  traditional  approach.  3)  Characteristic  Software  Metric 
Sets:  In  the  NASA  S.E.L.  production  environment,  a  study  of  65  candidate  pro¬ 
duct  and  process  measures  of  652  modules  from  six  (51.000  -  112.000  line)  pro¬ 
jects  yielded  a  characteristic  set  of  software  cost /quality  metrics.  ) 

A 

The  major  results  are  the  following.  l)The  approach  described  for  quant!- 

i  1  - 

tatlvely  evaluating  software  technologies  has  been  demonstrated  and  effective  In 


a  variety  of  problem  domains,  yi)  With  the  professionals,  code  reading  detected 
more  software  faults  and  had  a  higher  fault  detection  rate  than  did  functional 
or  structural  testlngt  while  -functional  testing  detected  more  faults  than  did 
structural  testing,  but  functional  and  structural  testing  were  not  different  In 
fault  detection  rate.  3)  With  the  students,  the  three  techniques  were  not 
different  In  the  number  of  faults  detected  or  In  the  fault  detection  rate,  except 
that  structural  testing  detected  fewer  faults  than  did  the  others  TiT~one  study 

phase.  4)  Code  reading  detected  more  Interface  faults  and  functional  testing 

C- ' 

detected  more  control  faults  than  did  the  other  methods.  5)  Most  developers 

y 

using  the  Cleanroom  software  development  approach  were  able  to  build  systems 
completely  off-line.  0)  The  Cleanroom  teams'  products  met  system  requirements 
more  completely  and  succeeded  on  more  operational  test  cases  than  did  those 
developed  with  a  traditional  approach.  ^7)  An  approach  described  for  calculat¬ 
ing  a  characteristic  metric  set  yielded  the  set^r  the  NASA  S.E.L.  environment 
{source  lines,  design  effort,  number  of  Input/output  parameters,  fault  correction 
effort  per  executable  statement,  code  effort,  number  of  versions}. 


ACKNOWLEDGEMENT 


I  greatly  appreciate  the  opportunity  to  have  worked  with  people  that  are 
shaping  the  frontier  of  the  software  engineering  field.  I  wish  to  thank  Larry  S. 
Davis,  John  D.  Gannon,  Harlan  D.  Mills,  and  Kent  L.  Norman  for  serving  on 
my  committee  and  providing  several  Insightful  comments  and  suggestions.  The 
Ideas  of  code  reading  by  stepwise  abstraction  and  Cleanroom  software  develop¬ 
ment  are  those  of  Harlan  D.  Mills,  to  whom  I  am  grateful.  I  wish  to  thank  F. 
Terry  Baker  for  his  essential  role  In  the  collection  of  a  major  portion  of  the  data 
presented.  The  assistance  provided  by  Frank  E.  McGarry  and  Jerry  Page  was 
crucial  to  the  success  of  my  analysis  In  the  NASA  Software  Engineering  Labora¬ 
tory.  I  want  to  thank  John  D.  Gannon  for  his  refreshing  attitude  and  valuable 
support.  The  members  of  our  research  group  and  several  fellow  graduate  stu¬ 
dents  have  offered  encouragement  and  helpful  criticisms  on  this  work.  I  also 
wish  to  thank  Michael  E.  Fagan.  David  H.  Hutchens,  and  Marvin  V.  Zelkowltz 
for  several  enlightening  discussions.  A  special  appreciation  goes  to  my  advisor, 
V'lctor  R.  Baslll,  whose  motivation,  leadership,  and  spirit  made  this  work  possi¬ 
ble. 

This  research  was  supported  In  part  by  the  Air  Force  Office  of  Scientific 
Research  Contract  AFOSR-F-19620-S0-C-001  and  the  National  Aeronautics  and 
Space  Administration  Grant  NSG-5123  to  the  University  of  Maryland.  Com¬ 
puter  support  provided  In  part  by  the  facilities  of  NASA  Goddard  Space  Filar.: 
Center  and  the  Computer  Science  Center  at  the  University  of  Marymad. 


Table  of  Contents 


1  Introduction  . 

2  A  Quantitative  Approach  for  Evaluating  Software  Technologies 

2.1  Methodology  for  Data  Collection  and  Analysis . 

2.2  Coupling  Goals  With  Analysis  Methods  . 

2.2.1  Forms  of  Result  Statements  . 

2.3  .Analysis  Classification  Scheme  . 

2.-1  Classification  of  Analyses  of  Software  . 

2.4.1  Blocked  Subject-Project  Studies  . 

2.4.2  Replicated  Project  Studies  . 

2.4.3  Multi-Project  Variation  Studies  . 

2.4.4  Single  Project  Studies  . 

3  Evaluation  of  Software  Technologies:  Problem  Selection  . 

3.1  Selection  Criteria  . 

3.2  .Analysis  Selection  . 

3.2.1  Soft-ware  Testing  Strategy  Comparison  . 

3.2.2  Cleanroom  Development  Approach  Analysis  . 

3.2.3  Characteristic  Metric  Set  Study  . 

3.3  Methodology  Application  . 

4  Evaluation  of  Software  Technologies:  Analysis  and  Results  . 

4.1  Software  Testing  Strategy  Comparison  . 

4.1.1  Testing  Techniques  . 

4. 1.1.1  Investigation  Goals . 

4.1.2  Empirical  Study  . 

4. 1.2.1  Iterative  Experimentation  . 

4. 1.2.2  Subject  and  Program/Fault  Selection  . 

4. 1.2. 2.1  Subjects  . 

4. 1.2.2. 2  Programs  . 

4. 1.2.2. 3  Faults  . 

4. 1.2.2. 3.1  Fault  Origin  . 

4. 1.2.2. 3.2  Fault  Classification 

4. 1.2.2. 3. 3  Fault  Description  .. 

4. 1.2.3  Experimental  Design  . 

4.1. 2.3.1  Independent  and  Dependent  Va 


4.1. 2.3.2  Analysis  of  Variance  Model  . 

4. 1.2.4  Experimental  Operation  . 

4.1.3  Data  Analysis  . 

4. 1.3.1  Fault  Detection  Effectiveness  . 

4. 1.3. 1.1  Data  Distributions . 

4. 1.3. 1.2  Number  of  Faults  Detected  . 

4. 1.3. 1.3  Percentage  of  Faults  Detected  . 

4. 1.3. 1.4  Dependence  on  Software  Type  . 

4. 1.3. 1.5  Observable  vs.  Observed  Faults  . 

4. 1.3. 1.0  Dependence  on  Program  Coverage  . 

4. 1.3. 1.7  Dependence  on  Programmer  Expertise 


4. 1.3. 1.8  Accuracy  of  Self-Estimates  . 

4. 1.3. 1.9  Dependence  on  Interactions  . 

4.1.3.1.10  Summary  of  Fault  Detection 

Effectiveness  . 

4. 1.3.2  Fault  Detection  Cost  . 

4.1. 3.2.1  Data  Distributions  . 

4.1. 3.2.2  Fault  Detection  Rate  and  Total  Time 


4. 1.3.2. 3  Dependence  on  Software  Type  . 

4. 1.3. 2. 4  Computer  Costs  . 

4. 1.3. 2. 5  Dependence  on  Programmer  Expertise 

4.1. 3.2.6  Dependence  on  Interactions  . 

4.1. 3.2.7  Relationships  Between  Fault  Detection 

Effectiveness  and  Cost  . 

4. 1.3.2. 8  Summary  of  Fault  Detection  Cost  . 

4. 1.3.3  Characterization  of  Faults  Detected  . 

4. 1.3. 3.1  Omission  vs.  Commission  Classification 


4. 1.3. 3. 2  ilx-Part  Fault  Classification  . 

4. 1.3.3. 3  Observable  Fault  Classification  . 

4. 1.3.3. 4  Summary  of  Characterization  of  Faults 

Detected  . 

4.1.4  Conclusions  . 

Cleanroom  Development  Approach  Analysis  . 

4.2.1  Cleanroom  Software  Development  Method  . 

4. 2. 1.1  Investigation  Goals  . 

4.2.2  Empirical  Study  Using  Cleanroom  . 


4. 2.2.1  Case  Study  Description  . 

4. 2. 2. 2  Operational  Testing  of  Projects  . . 

4.2.3  Data  Analysis  and  Interpretation  . 

4. 2. 3.1  Characterization  of  the  Effect  on  the  Product 

Developed  . 

4.2. 3. 1.1  Operational  System  Properties  . . 

4.2.3. 1.2  Static  System  Properties  . . 

4. 2.3. 1.3  Contribution  of  Programmer  Back¬ 

ground  . 

4. 2. 3. 1.4  Summary  of  the  Effect  on  the  Product 

Developed  . 

4. 2. 3.2  Characterization  of  the  Effect  on  the  Development 

Process  . 

4. 2. 3.2.1  Summary  of  the  Effect  on  the  Develop¬ 
ment  Process  . 

4. 2. 3.3  Characterization  of  the  Effect  on  the  Developers 


4. 2.3.3. 1  Summary  of  the  Effect  on  the  Develop¬ 

ers  . 

4.2. 3.4  Distinction  Among  Teams  . 

4.2.4  Conclusions  . 

4.3  Characteristic  Metric  Set  Study  . . 

4.3.1  Characteristic  Software  Metric  Sets . 

4. 3. 1.1  Investigation  Goals . 

4.3.2  Empirical  Study  . 

4. 3. 2.1  SEL  Environment  . 

4. 3. 2.2  Effort,  Change,  and  Fault  Data  . 

4.3.3  Data  Analysis  . 

4. 3. 3.1  Approach  for  Set  Calculation  . 

4. 3. 3. 1.1  An  Alternate  Approach  . 

4. 3. 3.2  Application  In  the  SEL  Environment  . 

4. 3. 3. 3  L'se  as  a  Management  Tool  . 

4. 3. 3. 3.1  Conditional  Probabilities  from  Histori¬ 

cal  Data  . 

4. 3. 3.3. 2  Data  Interpretation  . 

4.3.4  Conclusions  . 

o  Conclusions  . 

•5.1  Overall  Results  from  the  Software  Technology  Evaluations  . 

5.2  Problem  Areas  . 

5.3  Overall  Conclusions  . 


V 


8  Appendices  . 

8.1  Appendix  A.  Overview  of  Sampling  and  Statistical  Test  Applica¬ 

tion  . 

6.2  Appendix  B.  Programs  Used  In  the  Testing  Strategy  Comparison 


8.2.1  Appendix  B.l.  The  Specifications  for  the  Programs 

6.2.2  Appendix  B.2.  The  Source  Code  for  the  Programs 
0.3  Appendix  C.  Operational  Testing  Procedure  Applied  in  the 

Cleanroom  Study  . 

6.3.1  Test  Data  Selection . 


6.3.2  Testing  Process  and  Failure  Observation 

6.3.3  Failure  Counting  . 

7  References  . 


List  of  Figures 


i 

Figure  l.  Goal/questlon/metrlc  paradigm. 

Figure  2.  Categorization  of  analyses  of  software. 

Figure  3.  Three  analyses  selected. 

Figure  4.  Capabilities  of  the  testing  methods. 

Figure  5.  Structure  of  goals/subgoals/questlons  for  testing  experiment. 

Figure  a.  Expertise  levels  of  subjects, 
j  Figure  7.  The  programs  tested. 

Figure  8.  Programs  tested  In  each  phase  of  the  analysis. 

Figure  9.  Distribution  of  faults  In  the  programs. 

■  Figure  10.  Fault  classification  and  manifestation. 

Figure  11.  Fractional  Factorial  Design. 

|  Figure  12.  Overall  summary  of  detection  effectiveness  data. 

Figure  13.  Distribution  of  the  number  of  faults  detected  broken  down  by  phase. 
Figure  14.  Overall  summary  for  number  of  faults  detected. 

‘  Figure  15.  Overall  summary  of  fault  detection  cost  data. 

Figure  10.  Distribution  of  the  fault  detection  rate  (defaults  detected  per  hour) 
broken  down  by  phase. 

Figure  17.  Overall  summary  for  fault  detection  rate  (*£  faults  detected  per 
hour). 

i 


Figure  is.  Characterization  of  the  faults  detected. 


Figure  19.  Characterization  of  the  faults  observable,  but  not  reported. 


Figure  20.  Framework  of  goals  and  questions  for  Cleanroom  development  ap¬ 
proach  analysis. 

Figure  21.  Subjects'  professional  experience  in  years. 

Figure  22.  System  statistics. 

Figure  23.  Requirement  conformance  of  the  systems. 

Figure  24.  Percentage  of  successful  test  cases  during  operational  testing 
(without  duplicate  failures). 

Figure  25.  Breakdown  of  responses  to  the  attitude  survey  question,  “Did  you 
feel  that  you  and  your  team  members  effectively  used  off-line  review 
techniques  In  testing  your  project?". 

Figure  26.  Connect  time  In  hours  during  project  development. 

Figure  27.  Number  of  system  releases. 

Figure  28.  Breakdown  of  responses  to  the  attitude  survey  question,  "Did  you 
miss  the  satisfaction  of  executing  your  own  programs?". 

Figure  29.  Relationship  of  program  size  vs.  missing  program  execution. 

Figure  30.  Breakdown  of  responses  to  the  attitude  survey  question.  "How  was 

your  design  and  coding  style  affected  by  not  being  able  to  test  ami  de¬ 
bug?". 

Figure  31.  Breakdown  of  responses  to  the  attitude  survey  question.  "Would  you 
use  Cleanroom  again?”. 

Figure  32.  Summary  of  measure  averages  and  significance  levels. 

Figure  33.  Framework  of  goals  and  questions  for  characteristic  ser  study. 


vm 


Figure  3-4.  List  of  measures  examined  In  the  SEL  environment. 

Figure  35.  Conditional  probabilities  based  on  SEL  data:  upper  quartlles  of 
dependent  variables. 

Figure  36.  Conditional  probabilities  based  on  SEL  data:  lower  quartlles  of 
dependent  variables. 

Figure  37.  Regular  expression  of  logical  Inputs  to  the  system  In  a  single  user 
session. 

Figure  38.  Schedule  of  Deliveries  for  a  Sample  Team. 

Figure  39.  Two  Testing  Schedules  for  a  Sample  Team. 

Figure  40.  Arc  Frequency  Assignment  as  a  Result  of  Stratification. 

Figure  41.  Failure  Counting  Issues. 


1.  Introduction 


Computer  science  Is  both  a  theoretical  science  and  a  practical  science.  A 
lot  of  work  has  been  done  studying  theoretical  aspects  of  computer  science: 
determining  optimum  algorithms,  formulating  mathematical  models,  proving 
theorems,  etc.  However,  little  work  has  been  done  studying  the  practice  of  com¬ 
puter  science  -  studying  how  the  discipline  of  computer  science  Is  actually  ap¬ 
plied. 

There  are  several  motivations  for  studying  the  practice  of  computer  sci¬ 
ence.  Programs  In  practice  are  different  than  those  In  theory.  The  programs 
developed,  maintained,  and  managed  In  practice  tend  to  be  large,  unwieldy,  and 
complex.  Almost  everyone  associated  with  computer  science  has  had  an  experi¬ 
ence  where  he/she  has  said,  “Walt  a  minute,  that  did  not  turn  out  the  way  that 
I  thought  it  would!”.  Although  there  are  Insights  Into  how  the  theory  applies  In 
practice,  these  Insights  have  not  always  been  correct.  In  the  practice  of  com¬ 
puter  science,  few  objects  are  viewed  In  Isolation;  there  Is  a  complex  Interaction 
among  the  programmer,  methodology-tool-technique,  and  computer.  For  exam¬ 
ple.  consider  the  area  of  software  testing.  The  process  of  software  testing  has 
existed  a  long  time.  Testing  Is  the  most  common  way  to  attempt  to  show  that 
a  program  does  what  It  Is  Intended  to  do.  Several  theoretical  results  has  been 
published  In  the  area  of  software  testing.  Yet,  what  Is  the  best  way  to  test  a 
program  -  use  a  functional  testing  approach,  a  structural  approach,  a 


nonexecutlon-based  reading  process?  The  challenge  Is  that  the  best  approach  Is 


not  known.  How  Is  such  a  question  answered? 

The  overall  objective  of  this  dissertation  Is  to  examine  factors  that  contri¬ 
bute  to  software  development  and  maintenance.  The  Investigations  undertaken 
adhere  to  two  major  themes.  First,  the  factors  studied  should  have  a  high  po¬ 
tential  benefit  to  the  process  of  attaining  aspects  of  software  quality:  require¬ 
ment  conformance,  operational  reliability,  and  modifiable  source  code.  Second, 
the  Investigations  should  capture  the  effect  of  the  factors  precisely  by  character¬ 
izing  and  evaluating  them  quantitatively. 

The  three  analyses  presented  are  studies  of  1)  software  testing,  2)  Clean- 
room  software  development  (which  will  be  described  later),  and  3)  software 
metrics.  The  three  studies  are  Intended  to  advance  the  understanding  of  1)  the 
contribution  of  various  software  testing  strategies  to  the  software  development 
process  and  to  one  another;  2)  the  relationship  between  Introducing  discipline 
Into  the  development  process  (as  In  the  Cleanroom  approach)  and  several  as¬ 
pects  of  product  quality  (requirement  conformance,  operational  reliability,  and 
modifiable  source  code):  and  3)  the  use  of  software  metrics  to  characterize  soft¬ 
ware  environments  and  to  predict  project  outcome. 

The  evaluation  of  software  technologies  has  suffered  because  of  the  lack  of 
quantitative  assessment  of  their  effect  on  software  development  and 
modification.  This  dissertation  describes  a  seven-step  analysis  methodology  that 
Is  Intended  to  structure  the  process  of  evaluating  software  technologies.  The 
analysis  methodology  provides  a  paradigm  that  Is  applicable  In  a  variety  of 
problem  domains  and  Is  used  ln-depth  In  the  three  studies  presented. 


Section  2  describes  an  approach  for  quantitatively  evaluating  software  tech¬ 
nologies  and  classifies  previous  studies  of  software.  Section  3  discusses  the  selec¬ 
tion  of  the  three  Investigations  conducted.  The  problem  formulation,  data 
analysis,  and  results  from  the  three  studies  are  presented  In  Section  4.  Section 
5  summarizes  the  conclusions  from  this  work. 


2.  A  Quantitative  Approach  for  Evaluating  Software  Technologies 


Several  techniques  and  Ideas  have  been  proposed  to  Improve  the  software 
development  process  and  the  delivered  product.  There  Is  little  hard  evidence, 
however,  of  which  methods  actually  contribute  to  quality  In  software  develop¬ 
ment  and  modification.  As  a  consequence,  many  management  decisions  and 
research  Issues  are  resolved  by  Inexact  means  and  seasoned  Judgment,  without 
the  support  of  appropriate  data  and  analysis.  As  the  software  field  emerges,  the 
need  for  understanding  the  important  factors  In  software  production  continues 
to  grow.  The  evaluation  of  software  technologies  suffers  because  of  the  lack  of 
quantitative  assessment  of  their  effect  on  software  development  and 
modification. 

This  dissertation  supports  the  philosophy  of  coupling  methodology  with 
measurement.  That  Is,  tlelng  the  processes  of  software  methodology  use  and 
evaluation  together  with  software  measurement.  The  assessment  of  factors  that 
affect  software  development  and  modification  Is  then  grounded  in  appropriate 
measurement,  data  analysis,  and  result  Interpretation.  This  section  describes  a 
quantitatively  based  approach  to  evaluating  software  technologies.  The  formu¬ 
lation  of  problem  statements  In  terms  of  goal/question  hierarchies  Is  linked  with 
measurable  attributes  and  quantitative  analysis  methods.  These  frameworks  of 
goals  and  questions  are  Intended  to  outline  the  potential  effect  a  technology  has 
on  aspects  of  software  cost  and  quality.  Problem  formulation  linked  with  the 


collection  and  analysis  of  appropriate  data  Is  pivotal  to  any  management,  con- 


trol,  or  quality  Improvement  process. 

The  analysis  methodology  described  provides  a  framework  for  data  collec¬ 
tion,  analysis,  and  quantitative  evaluation  of  software  technologies.  The  para¬ 
digm  Identifies  the  aspects  of  a  well-run  analysis  and  Is  intended  to  be  applied 
In  different  types  of  problem  analysis  from  a  variety  of  problem  domains.  The 
methodology  presented  serves  not  only  as  a  problem  formulation  and  analysis 
paradigm,  but  also  suggests  a  scheme  to  characterize  analyses  of  software  devel¬ 
opment  and  modification.  The  use  of  the  paradigm  highlights  several  problem 
areas  of  data  collection  and  analysis  In  software  research  and  management. 

The  approach  described  for  quantitative  evaluation  of  software  technologies 
1)  applies  a  seven-step  methodology  for  data  collection  and  analysis,  2)  couples 
problem  formulation  with  quantitative  analysis  methods,  and  3)  suggests  an 
analysis  classification  scheme.  The  following  sections  describe  these  aspects  of 
the  approach. 

2.1.  Methodology  for  Data  Collection  and  Analysis 

The  methodology  described  for  data  collection  and  analysis  has  been  quite 
useful.  The  methodology  consists  of  seven  steps  that  are  listed  below  and  dis¬ 
cussed  In  detail  In  the  following  paragraphs  (see  also  [Baslll  &  Weiss  8-4]).  1) 
Formulate  the  goals  of  the  data  collection  and  analysis.  2)  Develop  a  list  of 
specific  questions  of  Interest.  3)  Establish  appropriate  metrics  and  data 
categories.  -4)  Plan  the  layout  of  the  Investigation,  experimental  design,  and 


statistical  analysis.  5)  Design  anil  test  the  data  collection  scheme.  6'  Perform 


the  Investigation  concurrently  with  data  collection  and  validation.  7)  Analyze 
and  Interpret  the  data  in  terms  of  the  goal/questlon  framework. 

A  first  step  In  a  management  or  research  process  Is  to  define  a  set  of  goals. 
Each  goal  Is  then  refined  Into  a  set  of  sub-goals  that  will  contribute  to  reaching 
that  goal.  This  refinement  process  continues  until  specific  research  questions 
and  hypotheses  have  been  formulated.  Associated  with  each  question  are  the 
data  categories  and  particular  metrics  that  will  be  needed  In  order  to  answer 
that  question.  The  Integration  of  these  first  three  steps  In  a 
goal/questlon/metrlc  hierarchy  (see  Figure  1)  expresses  the  purpose  of  an 
analysis,  defines  the  data  that  needs  to  be  collected,  and  provides  a  context  In 
which  to  Interpret  the  data. 

Figure  1.  Goal/ question/  metric  paradigm. 

Goals: 

Questions 

Metrics : 

In  order  to  address  these  research  questions,  Investigators  undertake  several 
types  of  analyses.  Through  these  analyses,  they  attempt  to  Increase  substantial¬ 
ly  their  knowledge  and  understanding  of  the  various  aspects  of  the  questions. 
The  analysis  process  Is  then  the  basis  for  resolving  the  research  questions  and 


for  pursuing  the  various  goals.  Before  actually  collecting  the  data,  the  data 
analysis  techniques  to  be  used  are  planned.  The  appropriate  analysis  methods 
may  require  an  alternate  layout  of  the  Investigation  or  additional  pieces  of  data 
to  be  collected.  A  well  planned  Investigation  facilitates  the  Interpretation  of  the 
data  and  generally  Increases  the  usefulness  of  the  results. 

Once  It  Is  determined  which  data  should  be  gathered,  the  Investigators 
design  and  test  the  collection  method.  They  determine  the  Information  that 
can  be  automatically  monitored,  and  customize  the  data  collection  scheme  to 
the  particular  environment.  The  several  types  of  data  that  need  to  be  collected 
usually  require  a  data  collection  plan  balanced  across  collection  forms,  automat¬ 
ed  measurement,  and  personnel  Interviews.  After  all  the  planning  has  occurred, 
the  data  collection  Is  performed  concurrently  with  the  Investigation  and  is  ac¬ 
companied  by  suitable  data  validity  checks. 

As  soon  as  the  data  have  been  validated,  the  Investigators  do  preliminary 
data  analysis  and  screening  using  scatter  plots  and  histograms.  After  fulfilling 
the  proper  assumptions,  they  apply  the  appropriate  statistical  and  analytical 
methods.  The  statistical  results  are  then  organized  and  Interpreted  with  respect 
to  the  goal/questlon  framework.  More  information  Is  gathered  as  the  analysis 
process  continues,  with  the  goals  being  updated  and  the  whole  cycle  progressing. 

2.2.  Coupling  Goals  With  Analysis  Methods 

Several  of  the  steps  In  the  above  data  collection  and  analysis  methodology 
Interrelate  with  one  another.  The  structure  of  the  goals  and  questions  should  be 


coupled  with  the  methods  proposed  to  analyze  the  data.  The  particular  ques¬ 
tions  should  be  formulated  to  be  easily  supported  by  analysis  techniques.  In  ad¬ 
dition,  questions  should  consider  attributes  that  are  measurable.  Most  analyses 
make  some  result  statement  (or  set  of  statements)  with  a  given  precision  about 
the  effect  of  a  factor  over  a  certain  domain  of  objects.  Considering  the  form  of 
analysis  result  statements  will  assist  the  formation  of  goals  and  questions  for  an 
Investigation,  and  will  make  the  statistical  results  more  readily  correspond  to 
the  goals  and  questions. 

2.2.1.  Forms  of  Result  Statements 

Consider  a  question  In  an  Investigation  phrased  as  “For  objects  in  the 
domain  D,  does  factor  F  have  effect  S?".  The  corresponding  result  statement 
could  be  "Analj'sls  A  showed  that  for  objects  In  the  domain  D,  factor  F  had 
effect  S  with  certainty  P.’\  In  particular,  a  question  could  read  "For  novice 
programmers  doing  unit  testing,  does  functional  testing  uncover  more  faults 
than  does  structural  testing?”.  An  appropriate  response  from  an  analysis  may 
then  be  "In  a  blocked  subject-project  study  of  novice  programmers  doing  unit 
testing,  functional  testing  uncovered  more  faults  than  did  structural  testing  (a 
<  .05).". 

Result  statements  on  the  effects  of  factors  have  varying  strengths,  but  usu¬ 
ally  are  either  characteristic,  evaluative,  predictive,  or  directive.  Characteristic 
statements  are  the  weakest.  They  describe  how  the  objects  In  the  domain  have 
changed  as  a  result  of  the  factor.  E.g..  "A  blocked  subject-project  study  of  no- 


8 


*  .  •  >  7 


vice  programmers  doing  unit  testing  showed  that  using  code  reading  detected 
and  removed  more  logic  faults  than  computation  faults  (a  <  .05).''  Evaluative 
statements  associate  the  changes  In  the  objects  with  a  value,  usually  on  some 
scale  of  goodness  or  Improvement.  E.g.,  “A  blocked  subject-project  study  of  no¬ 
vice  programmers  doing  unit  testing  showed  that  using  code  reading  detected 
and  removed  more  of  the  expensive  faults  to  correct  than  did  functional  testing 
(a  <  .05)."  Predictive  statements  are  a  stronger  statement  type.  They  describe 
how  objects  In  the  domain  will  change  If  subjected  to  a  factor.  E.g.,  "A  blocked 
subject-project  study  showed  that  for  novice  programmers  doing  unit  testing, 
the  use  of  code  reading  will  detect  and  remove  more  logic  faults  than  computa¬ 
tion  faults  (a  <  .05)."  Directive  statements  are  the  strongest  type.  They  fore¬ 
tell  the  value  of  the  effect  of  applying  a  factor  to  objects  In  the  domain.  E.g., 
"A  blocked  subject-project  study  showed  that  for  novice  programmers  doing 
unit  testing,  the  use  of  code  reading  will  detect  and  remove  more  of  the  expen¬ 
sive  faults  to  correct  than  will  functional  testing  (a  <  .05)."  The  analysis  pro¬ 
cess  then  consists  of  an  Investigative  procedure  to  achieve  the  result  statements 
of  the  desired  strength  and  precision  after  considering  the  nature  of  the  factors 
and  domains  Involved. 

Given  any  factor,  researchers  would  like  to  make  as  strong  a  statement 
with  as  high  a  precision  about  Its  effect  In  as  large  a  domain  as  possible.  Unfor¬ 
tunately,  as  the  statement  applies  to  an  Increasingly  large  domain,  the  strength 
of  the  statement  or  the  precision  with  which  we  can  make  It  may  decrease.  In 
order  for  analyses  to  produce  useful  statements  about  factors  In  large  domains. 


9 


the  particular  aspects  of  a  factor  and  the  domains  of  Its  application  must  be 
well  understood  and  Incorporated  into  the  Investigative  scheme. 

2.3.  Analysis  Classification  Scheme 

Two  Important  sub-domains  that  should  be  considered  In  the  analysis  of 
factors  in  software  development  and  modification  are  the  Individuals  applying 
the  technology  and  what  they  are  applying  It  to.  These  two  sub-domains  will 
loosely  be  referred  to  as  the  “subjects,”  a  collection  of  (possibly  multi-person) 
teams  engaged  In  separate  development  efforts,  and  the  “projects,”  a  collection 
of  separate  problems  or  pieces  of  software  to  which  a  technology  Is  applied.  A 
general  classification  of  several  software  analyses  In  the  literature  can  be  ob¬ 
tained  by  examining  the  sizes  of  these  two  sub-domains  that  they  consider. 


Figure  2.  Categorization  of  analyses  of  software. 

projects 


one  more  than  one 


-t - 

1 

-t - r- 

1  1 

one 

1 

|  single  project 

1  1 
|  imlti-project  | 

#  teams  per 

1 

! 

;  variation  1 

1  1 

project 

1 

(  replicated 

1  1 
|  blocked  | 

more  than 

1  project 

|  subject-project  | 

one 

1 

1  ! 

J _ L. 

Figure  2  presents  this  four  part  analysis  categorization  scheme.  Blocked 
subject-project  studies  examine  the  effect  of  possibly  several  technologies  as 
they  are  applied  by  a  set  of  subjects  on  a  set  of  projects.  If  appropriately 
configured,  this  type  of  study  enables  comparison  within  the  groups  of  technolo¬ 
gies,  subjects,  and  projects.  In  replicated  project  studies,  a  set  of  subjects  may 
separately  apply  a  technology  (or  maybe  a  set  of  technologies)  to  the  same  pro¬ 
ject  or  problem.  Analyses  of  this  type  allow  for  comparison  within  the  groups 
of  subjects  and  technologies  (If  more  than  one  used).  A  multi-project  variation 
study  examines  the  effect  of  one  technology  (or  maybe  a  set  of  technologies!  as 
applied  by  the  same  subject  across  several  projects.  These  analyses  support  the 
comparison  within  groups  of  projects  and  technologies  (If  more  than  one  used!. 
A  single  project  analysis  Involves  the  examination  of  one  subject  applying  a 
technology  on  a  single  project.  The  analysis  must  partition  the  aspects  within 


the  particular  project,  technology,  or  subject  for  comparison  purposes. 


Result  statements  of  all  four  types  mentioned  above  can  be  derived  from 
all  these  analysis  classes.  However,  the  statements  will  need  to  be  qualified  by 
the  domain  from  which  they  were  obtained.  Thus  as  the  size  of  the  sampled 
domain  and  the  degree  to  which  It  represents  other  populations  Increase,  the 
wider-reaching  the  conclusion. 

The  next  section  cites  several  software  analyses  from  the  literature  and 
classifies  them  according  to  this  scheme  pictured  In  Figure  2. 


2.4.  Classification  of  Analyses  of  Software 


Several  Investigators  have  published  studies  In  the  four  general  areas  of 
blocked  subject-project,  replicated  project,  multi-project  variation,  or  single 
project.  The  following  sections  cite  analyses  of  the  software  development  pro¬ 
cess  and  product  from  each  of  these  categories.  Note  that  surveys  on  experi¬ 
mental  methodology  In  empirical  studies  have  appeared  In  the  literature  [Brooks 
80,  Shell  81,  Moher  &  Schneider  82]. 

2.4.1.  Blocked  Subject-Project  Studies 

[Curtis  et  al.  79]  describe  two  experiments  Investigating  factors  that 
Influence  two  aspects  of  software  maintenance,  understanding  existing  programs 
and  accurately  Implementing  modifications  to  them.  The  analyses  Involved  the 
performance  of  72  programmers  operating  on  several  versions  of  programs  In 
three  general  software  classes.  The  factors  examined  Include  control  flow  com¬ 
plexity,  variable  name  mnemonlclty,  type  of  modification,  degree  of  comment¬ 
ing,  and  the  relation  of  programmer  performance  to  various  complexity  metrics. 
They  continued  the  Investigation  of  how  software  characteristics  relate  to 
psychological  complexity  In  [Curtis,  Sheppard  Ss  Mllllman  70] .  This  second  pa¬ 
per  describes  a  third  experiment  monitoring  the  ability  of  54  programmers  to 
detect  different  program  bugs  In  distinct  program  versions. 

[Hetzel  78]  conducted  a  controlled  experiment  comparing  different  software 
testing  techniques.  The  methods  of  functional  testing,  code  reading,  and  a  con¬ 
trol  group  (both  capabilities)  were  applied  by  39  subjects  to  three  different  pro 


*  -  i  -■  -  ‘-1 


13 


•1 


grams.  In  addition  to  describing  technique  performance,  the  testing  strategies 
were  related  to  factors  of  programmer  background,  self  estimates  of  perfor¬ 
mance,  and  attitude. 

(Mlara  et  al.  83]  describe  a  study  to  determine  the  contribution  of  Indenta¬ 
tion  to  program  comprehensibility.  The  experimental  approach  examined  the 
factors  of  level  and  type  of  Indentation,  as  well  as  level  of  programmer  experi¬ 
ence.  The  understanding  of  seven  different  program  variations  was  obtained  by 
a  comprehension  quiz  and  a  subjective  rating  of  how  difficult  the  program  was 
to  comprehend. 

[Welssman  74]  described  several  experiments  conducted  to  measure  a 
subject's  ability  to  understand  a  program  and  his/her  ability  to  modify  It.  The 
four  areas  of  factors  examined  Included  aspects  of  program  form,  control  flow, 
data  flow,  and  Interaction  between  control  and  data  flow.  The  number  of  sub¬ 
jects  studied  ranged  from  10  -  48.  Each  experiment  used  two  different  pro¬ 
grams  presented  In  varying  combinations  of  the  above  factors.  The  measure¬ 
ments  of  understanding  Included  self-evaluations,  flll-ln-the-blank  quizzes,  pro¬ 
gram  hand  simulation,  ability  to  modify  the  program,  and  comprehension 
quizzes.  The  experiments  were  conducted  sequentially  to  support  the 
refinement  of  an  appropriate  experimental  methodology. 

[Gould  &  Drongowskl  74]  examined  several  factors  related  to  computer  pro¬ 
gram  debugging:  effect  of  debugging  aids,  effect  of  fault  type,  and  effect  of  par¬ 
ticular  program  debugged.  Thirty  experienced  programmers  separately  de¬ 
bugged  programs  that  contained  a  single  fault.  Three  classes  of  faults  In  four 


14 


different  one-page  programs  were  used.  Learning  effects  were  examined  and 
some  possible  "principles"  of  debugging  were  Identified.  Consistent  results  were 
obtained  when  the  study  was  conducted  on  ten  additional  experienced  program¬ 
mers  [Gould  75]. 

[Gannon  &  Horning  75]  Investigated  the  factor  of  language  design  and  Its 
relation  to  the  reliability  of  the  resulting  software.  Nine  different  language 
modifications  were  made  to  a  programming  language  based  on  a  analysis  of  Its 
deficiencies.  Two  differently  experienced  subject  groups  completed  Implementa¬ 
tions  of  two  small  but  sophisticated  programs  (75-200  line)  In  the  original 
language  and  In  Its  modified  version.  The  performance  of  the  redesigned 
features  In  the  two  languages  were  contrasted  In  the  frequency,  type,  and  per¬ 
sistence  of  faults  In  the  programs  written  by  the  subjects. 

[Soloway  &  Ehrlich  84]  examined  two  aspects  of  programming  knowledge: 
programming  plans  and  rules  of  programming  discourse.  Programming  plans 
are  generic  program  fragments  that  represent  stereotypic  action  sequences  In 
programming.  Rules  of  programming  discourse  capture  conventions  In  program¬ 
ming  and  govern  the  composition  of  the  plans  Into  programs.  A  total  of  139 
subjects  participated  In  an  experiment  that  required  them  to  flll-ln-the-blank  In 
programs  selected  from  four  different  software  types.  Some  of  the  programs 
were  written  to  violate  certain  hypothesized  programming  plans  and  discourse 
rules.  A  second  similar  study  Involving  41  professional  programmers  was  con¬ 
ducted.  The  results  In  general  support  the  existence  and  use  of  such  plans  and 
rules  by  both  novice  and  advanced  programmers. 


o 


Other  blocked  subject-project  studies  Include  [Panzl  81,  Woodfield, 


I 

f 

Dunsmore  &  Shen  81] . 

2.4.2.  Replicated  Project  Studies 

[Baslll  &  Reiter  81]  present  a  study  In  which  three  different  software  devel¬ 
opment  approaches  are  analyzed  and  compared.  Seven  three-person  teams  used 
a  disciplined  approach,  six  teams  used  an  ad  hoc  approach,  and  six  Individuals 
used  an  ad  hoc  approach.  Each  of  the  19  separate  development  efforts  Imple¬ 
mented  a  1200  line  compiler  project.  This  allowed  a  comparison  among  the 
different  development  approaches,  as  well  as  among  usability  of  various  metrics 
for  process  measurement.  A  primary  motivation  for  the  experiment  was  to 
confirm  certain  beliefs  of  the  beneficial  effects  of  a  particular  disciplined  meth¬ 
odology  for  software  development.  The  researchers  examined  the  factor  of  de¬ 
velopment  technique  and  showed  partial  support  for  some  of  the  beliefs  by  cap¬ 
turing  several  objective  and  automatable  metrics  of  the  development  process 
and  product. 

[Johnson,  Draper  <fc  Soloway  83]  describe  one  of  several  studies  done  with 
the  Intention  of  characterizing  misconceptions  made  by  programmers  and  how 
they  are  manifested  as  bugs  In  programs.  In  this  work  they  Inspected  the  at¬ 
tempted  Implementations  of  an  elementary  problem  by  204  novice  program¬ 
mers.  They  then  classified  the  differences  between  the  Incorrect  "buggy"  pro¬ 
grams  and  correct  versions  of  similar  structure.  The  differences  were  explained 
relative  to  mistakes  In  "programming  plans"  Intended  by  the  Individual  [Solo- 


way  et  al.  82].  Further  work  comparing  the  factor  of  design  strategies  of  novice 
and  expert  programmers  Is  underway  [Soloway  83]. 

[Bailey  84]  presents  preliminary  observations  from  an  experiment  In  which 
the  Ada  programming  language  was  taught  In  two  different  fashions.  1  One  class 
of  subjects  was  first  taught  the  high-level  concepts  supported  In  the  language, 
such  as  modular  design  and  data  abstraction,  and  then  taught  the  actual  con¬ 
structs  and  syntax  of  the  language.  The  same  material  in  a  reverse  order  was 
presented  to  a  second  class  of  subjects;  that  Is.  constructs  first  and  concepts 
second.  Thus  the  factor  studied  was  order  of  presentation  of  material.  In  addi¬ 
tion  to  some  preliminary  exercises  and  academic  scores,  the  two  groups  were 
compared  on  their  ability  to  apply  Ada  In  the  design  (only)  of  a  small  software 
system. 

[Knight  84]  examined  the  possibility  of  building  ultra-reliable  software  sys¬ 
tems  by  using  N-verslon  programming.  The  technique  of  N-verslon  program¬ 
ming  [Kelly  82]  uses  a  high-level  driver  to  connect  several  separately  designed 
versions  of  the  same  system.  The  systems  then  "vote”  on  the  correct  solution, 
and  the  solution  provided  by  the  majority  of  the  systems  Is  output.  The  study 
examined  37  separately  designed  versions  of  the  same  800  source  line  system. 
The  factors  examined  Included  individual  system  reliability,  total  N-verslon  sys¬ 
tem  reliability,  and  classes  of  faults  that  occurred  In  systems  simultaneously. 

[Gannon  77]  Investigated  the  factor  of  static  typing  In  programming 
languages.  Two  languages  were  generated  which  were  essentially  equivalent  ex- 

1  Ada  Is  a  trademark  of  t!i*  L\  S  D»pt.  of  Defense 


cept  for  differences  In  the  type  conventions:  one  was  statically  typed  (with  In¬ 
teger  and  string  types)  and  the  other  typeless  (e.g.,  arbitrary  subscripting  of 
memory).  A  group  of  38  subjects  programmed  the  same  problem  In  both 
languages,  with  half  doing  It  In  each  order.  The  two  languages  were  compared 
In  the  types  of  faults  In  the  resulting  programs,  the  number  of  runs  containing 
faults,  and  the  relation  of  subject  experience  to  fault  proneness. 

[Shnelderman  et  al.  77]  examined  the  factor  of  detailed  flowcharts  as  an  aid 
to  program  composition,  comprehension,  debugging,  and  modification.  A  series 
of  five  experiments  was  conducted  on  groups  of  53  -  70  subjects  of  novice  to  In¬ 
termediate  expertise.  A  single  program  was  used  In  all  experiments  except  one, 
which  used  two  programs.  All  experiments  compared  the  performance  In  vari¬ 
ous  programming  tasks  between  groups  that  used  some  form  of  flowchart  and 
those  that  did  not.  The  performance  of  the  groups  was  measured  by 
comprehension  quiz  scores,  correctness  of  programs  written,  correctness  of 
modifications  requested,  and  successful  removal  of  seeded  faults.  No  significant 
differences  were  reported  between  groups  that  used  and  those  that  did  not  use 
flowcharts. 

[Parnas  72a]  Investigated  the  factor  of  proper  system  modularity  as  a 
means  to  eliminate  the  "Integration  phase”  In  development.  A  given  system 
was  decomposed  Into  five  modules,  and  four  different  types  of  Implementation 
were  specified  for  each  module.  Twenty  subjects  then  Independently  developed 
the  distinct  Implementations  for  the  particular  modules.  At  project  completion, 
numerous  combinations  of  the  modules  were  assembled  to  form  separate  ver- 


slons  of  the  whole  system.  The  minor  effort  required  In  assembling  the  systems 
evidenced  support  for  the  Ideas  on  formal  specifications  and  modularity  dis¬ 
cussed  In  [Parnas  72b,  Parnas  72c]. 

[Boehm  et  al.  84]  Investigated  the  system  development  approachs  of  proto¬ 
typing  and  specifying.  Seven  teams  developed  versions  of  the  same  application 
software  system  (2000  -  4000  line);  four  teams  used  a  requirement/design 
specification  approach  and  three  teams  used  a  prototyping  approach.  The  final 
prototyped  products  were  smaller,  required  less  development  effort,  and  were 
easier  to  use.  The  systems  developed  by  specifications  had  more  coherent 
designs,  more  complete  functionality,  and  software  that  was  easier  to  Integrate. 

[Myers  78]  examined  the  factor  of  program  testing  technique  and  its  rela¬ 
tion  to  defect  detection.  The  three  techniques  of  3-person  walk-throughs,  func¬ 
tional  testing,  and  a  control  group  were  compared  In  the  testing  of  a  small  (100 
line)  but  nontrivial  program.  Fifty-nine  data  processing  professionals  were  used 
as  subjects.  The  techniques  and  their  random  pairings  were  compared  In  the 
number  of  faults  found  and  their  cost-effectiveness.  The  single  techniques  were 
not  different  in  the  number  of  faults  they  detected,  while  pairings  of  techniques 
were  superior  In  terms  of  number  of  faults  found. 

Other  replicated  project  studies  Include  [Buck  81.  Hwang  81,  Hutchens 


Baslll  83'. 


2.4.3.  Multi-Project  Variation  Studies 


[Walston  Sc  Felix  77,  Bailey  &  Baslll  81,  Baslll  Sc  Freburger  81,  Brooks  81, 
Baslll,  Selby  Sc  Phillips  83,  Vosburgh  et  al.  84]  are  some  of  the  numerous  studies 
that  have  examined  technological  factors  across  several  projects.  The  studies 
that  consider  separate  development  efforts  coming  from  a  single  team  or  homo¬ 
geneous  environment  genuinely  belong  In  this  category.  Those  that  considered 
projects  from  a  collection  of  heterogeneous  teams  or  environments,  such  as  [Vos¬ 
burgh  et  al.  84],  are  placed  here  because  they  examined  the  differences  In  effect 
of  the  factors,  but  not  the  teams  or  environments.  The  factors  Investigated  In¬ 
clude  structured  programming,  personnel  background,  development  process  and 
product  constraints,  project  complexity,  human  and  computer  resource  con¬ 
sumption,  project  duration,  staff  size,  degree  of  management  control,  and  pro¬ 
ductivity.  In  particular,  [Bailey  Sc  Baslll  81]  mention  82  factors  that  could  pos¬ 
sibly  affect  project  performance,  Including  38  from  [Walston  Sc  Felix  77]  and  16 
from  [Boehm  81].  They  then  describe  a  model  generation  process  that  uses  a 
base-line  of  particular  environmental  aspects  and  captures  differences  among 
projects.  The  number  of  projects  examined  ranges  from  18  In  [Bailey  Sc  Baslll 
Si]  to  51  In  [Walston  Sc  Felix  77,  Brooks  81].  Among  other  results,  these  studies 
have  led  to  Increased  project  visibility,  greater  understanding  of  classes  of  fac¬ 
tors  sensitive  to  project  performance,  awareness  of  the  need  for  project  measure¬ 
ment.  and  efforts  for  standardization  of  definitions.  Analysis  has  begun  on  in¬ 
corporating  project  variation  Information  Into  a  management  tool  [Baslll  V 
Dcerfllnger  S3]. 


S 


[Bowen  8-t]  examined  the  factors  of  estimating  the  number  of  residual 


faults  In  a  system  and  of  assessing  the  effectiveness  of  various  testing  stages. 
The  study  was  based  on  fault  data  collected  from  three  large  (2000  -  6000 
module)  systems  developed  In  the  Hughes-Fullerton  environment.  The  study 
partitioned  the  faults  based  on  severity  and  analyzed  the  differences  In  esti¬ 
mates  of  remaining  faults  according  to  stage  of  testing. 

[Adams  84]  examined  the  factor  of  managing  preventive  service  of  software 
products  In  operational  use.  Preventive  service  constitutes  Installing  Axes  to 
faults  that  have  yet  to  be  discovered  by  particular  users,  but  have  been 
discovered  by  the  vendor  or  other  users.  The  study  developed  means  to  esti¬ 
mate  whether  and  under  what  circumstances  preventively  fixing  faults  In  opera¬ 
tional  software  In  the  field  was  appropriate.  The  fault  history  for  several  large 
products  (e.g.,  operating  system  releases,  major  components  thereof)  was  empiri¬ 
cally  modeled. 

[Vessey  &  Weber  83]  examined  the  contribution  of  several  factors  to  soft¬ 
ware  maintenance:  program  complexity,  programming  style,  programmer  quali¬ 
ty,  and  number  of  system  releases.  A  total  of  447  commercial  and  clerical 
Cobol  programs  In  operation  In  one  Australian  organization  and  In  two  U.S.  or¬ 
ganizations  were  analyzed.  The  programs  ranged  from  small  to  over  6 00  state¬ 
ments.  In  the  Australian  organization  program  complexity  and  programming 
style  significantly  affected  the  rate  of  maintenance  repair.  In  the  U.S.  organiza¬ 
tions  the  number  of  rimes  a  system  was  released  significantly  affected  the 


maintenance  repair  rate. 


2.4.4.  Single  Project  Studies 


[Endres  75,  Baslll  &  Weiss  81,  Albln  &  Ferreol  82,  Ostrand  &  Weyuker  83, 
Baslll  &  Perrlcone  84]  present  the  analysis  of  the  distribution  and  relationships 
derived  from  change  data  collected  during  the  development  of  a  moderate  to 
large  software  project.  On  a  within  project  basis,  they  examined  such  factors  as 
the  frequency  and  distribution  of  faults  during  development,  and  their  relation¬ 
ship  with  the  factors  of  module  size,  software  complexity,  developer’s  experi¬ 
ence,  method  of  detection  and  Isolation,  phase  of  entrance  Into  the  system  and 
observance,  reuse  of  existing  design  and  code,  and  role  of  the  requirements  do¬ 
cument.  Although  conducted  on  only  a  single  project,  such  analyses  have  pro¬ 
duced  fault  categorization  schemes  and  have  been  useful  In  understanding  and 
Improving  a  development  environment. 

[Gannon  et  al.  83,  Baslll  et  al.  85]  examined  a  ground-support  system  writ¬ 
ten  In  Ada  to  characterize  the  use  of  Ada  packages.  Factors  such  as  how  pack¬ 
age  use  affected  the  ease  of  system  modification  and  how  to  measure  module 
change  resistance  were  Identified,  as  well  as  how  these  observations  related  to 
aspects  of  the  development  and  training. 

[Baslll  Sc  Ramsey  84,  Ramsey  84]  Investigated  the  structural  coverage  of 
functionally  generated  Input  data.  The  functionally  generated  acceptance  test 
cases  and  a  sample  of  operational  usage  cases  were  analyzed  from  a  medium¬ 
sized  (10.000  line)  satellite  support  system.  The  study  examined  the  factors  of 


the  structural  coverage  of  functional  acceptance  testing,  the  structural  coverage 


of  operational  product  usage,  the  relationship  between  the  program  segments 
covered  In  acceptance  testing  and  those  covered  In  usage,  and  the  relationship 
between  structural  coverage  and  fault  detection. 

[Baker  72a]  analyzed  the  effect  of  applying  chief  programming  teams  and 
structured  programming  In  system  development.  The  large  (83,000  line)  system 
discussed  Is  known  as  The  New  York  Times  Project.  The  project  served  as  a 
held  test  for  the  new  programming  methodology  concepts  of  structured  code, 
top  down  design,  chief  programmer  teams,  and  program  libraries.  Several 
benefits  were  Identified,  Including  reduced  development  time  and  cost,  reduced 
time  In  system  Integration,  and  reduced  fault  detection  In  acceptance  testing 


3.  Evaluation  of  Software  Technologies:  Problem  Selection 

The  approach  described  In  the  previous  section  Is  Intended  to  structure  the 
process  of  analyzing  software  technologies  by  coupling  software  methodology 
evaluation  with  software  measurement.  The  paradigm  is  applied  ln-depth  In 
three  different  analyses.  In  addition  to  evaluating  software  technologies,  the 
studies  demonstrate  the  feasibility,  utility,  and  effectiveness  of  the  quantitative 
analysis  paradigm.  This  section  describes  the  selection  of  the  different  Investi¬ 
gations  conducted.  The  selection  criteria  for  all  the  studies  Is  discussed  first, 
followed  by  an  overview  of  each  of  the  studies  and  how  they  apply  the  para¬ 
digm. 

3.1.  Selection  Criteria 

Three  different  studies  were  chosen  to  satisfy  several  criteria:  scope  of 
evaluation,  domain  sampling,  quantitative  analysis  method,  area  of  assessment, 
scope  of  technology,  and  potential  benefit. 

1)  Scope  of  Evaluation  -  Each  of  the  analyses  should  be  a  distinct  type  of 
study  relative  to  the  categorization  of  blocked  subject-project,  replicated  pro¬ 
ject,  multi-project  variation,  and  single  project.  These  different  classes 
represent  different  pairings  of  the  domain  sizes  of  subjects  and  projects.  Using 
the  paradigm  In  these  different  categories  shows  Its  support  for  analysis  of  tech¬ 
nologies  across  different  scopes  of  evaluation. 

2)  Domain  Sampling  -  The  samples  chosen  from  the  subject  and  project 
domains  In  the  studies  should  be  representative  of  reasonably  large  populations. 


24 


.Assessments  using  different  sizes  of  software  projects  and  different  sizes  of  teams 
should  be  chosen.  The  selection  and  analysis  of  appropriate  samples  facilitates 
the  extrapolation  of  the  results  to  other  environments.  Increases  the  usefulness 
of  the  results,  and  shows  the  performance  of  the  paradigm  In  such  situations. 

3)  Quantitative  Analysis  Method  -  Each  of  the  studies  should  utilize  a 
different  method  of  quantitative  analysis.  Statistical  techniques  provide  a 
soundly  based,  objective,  and  usually  automatable  mechanism  to  accomplish 
quantitative  analysis  tasks.  Unfortunately,  however,  the  amount  of  data  re¬ 
quired  by  some  statistical  approaches  leaves  them  economically  Infeasible.  Even 
with  sufficient  data,  the  generated  results  may  yield  unacceptable  precision,  too 
much  unexplained  variance,  or  doubt  as  to  whether  all  the  Important  factors  are 
effectively  captured.  The  use  of  this  evaluation  paradigm  with  a  variety  of 
quantitative  methods  demonstrates  the  flexibility  of  the  approach  across  varying 
amounts  and  types  of  data. 

4)  Area  of  Assessment  -  The  different  problem  areas  Investigated  should 
not  be  precisely  understood  areas  of  software  development  and  modification. 
The  areas  chosen  should  have  open  questions  and  unresolved  Issues.  Selecting 
problems  with  these  attributes  provides  a  scenario  similar  to  decision  making  si¬ 
tuations  In  the  field,  where  the  proper  outcome  of  the  analysis  Is  not  known  be¬ 
forehand. 

5)  Scope  of  Technology  -  The  analyses  should  examine  technologies  that 
are  have  distinct  scopes  of  usage  during  software  development  and  modification. 
Three  different  scopes  of  usage  for  technologies  are  a)  Individual  technique:  a 


Slagle  technique  used  In  conjunction  with  other  techniques  during  a  software 
project;  b)  development  methodology;  a  system  of  methods  that  applies  across 
the  whole  software  project  development;  and  c)  environment  methodology:  a 
system  of  methods  that  applies  across  several  projects  In  a  development  or 
modification  environment.  Using  the  paradigm  In  these  categories  demonstrates 
Its  effectiveness  for  evaluation  of  technologies  having  varying  scopes  of  usage. 

8)  Potential  Benefit  -  The  analyses  should  address  factors  that  can 
significantly  contribute  to  the  quality  of  the  software  development  process  and 
the  developed  product.  The  need  for  analysis  of  which  factors  contribute  to 
quality  in  software  development  and  modification  Is  fundamental  to  the  ad¬ 
vancement  of  the  field.  The  production  of  useful  results  from  the  use  of  the  ap¬ 
proach  helps  demonstrate  Its  merit. 

3.2.  Analysis  Selection 

From  the  above  set  of  criteria,  three  analyses  were  selected:  1)  a  comparis¬ 
on  of  software  testing  strategies;  2)  an  analysis  of  Cleanroom  software  develop¬ 
ment;  and  3)  a  calculation  of  a  characteristic  software  metric  set.  Figure  3  sum¬ 
marizes  these  studies  relative  to  the  criteria  explained.  Recall  the  use  of  the 
symbols  from  the  previous  section  describing  the  quantitative  methodology:  D 
for  domain  sampled.  F  for  factor  or  technology  analyzed,  and  S  for  result  state¬ 
ment  type.  As  displayed  In  the  figure,  the  above  set  of  criteria  are  satisfied  by 
the  particular  analyses  selected.  The  following  three  sections  discuss  the  appli¬ 
cation  of  the  approach  In  the  particular  studies. 


Figure  3.  Three  analyses  selected. 


Testing  Study 


Scope  of  Evaluation  blocked  subject' 

project 


Subject  Domains 
Sampled  ( D  ,) 


Number 


Cleanroom  Study 


Characteristic  Set 


multi-project 

variation 


Expertise 


Project  Domains 
Sampled  (D  o) 


Number 


Size 


74 

Individuals 


Junior  - 
advanced 


15  small  teams 
(3-person) 


Junior  - 
Intermediate 


l  medium 
environment 
(23-person) 


Junior  - 
advanced 


unit 

(100  -  350  LOC) 


small  system 
(1200  LOC) 


Quantitative 
Analysis  Method 


Area  of 
Assessment 


fractional 

factorial 

design 


defect 

detection 


non-parametrlc 

statistics 


large  system 
(51,000  -  112,000 
LOC) 


factor 

analysis 


project 

development 


Scope  of 
Technologies 

Individual 

technique 

development 

methodology 

Factors  (F) 

code  reading 
functional  testing 
structural  testing 

Cleanroom 

development 

traditional 

development 

Potential 

Benefit 

Increase 

effectiveness 

of  defect 

detection 

Increase  product 
quality  and 
process  control 

Result 

Statements  (S) 

characteristic 

evaluative 

predictive 

characteristic 

evaluative 

predictive 

project 

management 


environment 

methodology 


SEL 

environment 


better  project 
monitoring  and 
control 


characteristic 

evaluative 

predictive 


ware  types.  The  Individuals  (domain  D were  selected  from  the  populations  of 
Junior,  Intermediate,  and  advanced  programmers,  and  the  programs  tested 
(domain  D  s)  were  of  unit  size  and  were  selected  from  four  populations  of  soft¬ 
ware  types.  The  programs  had  a  distribution  of  faults  that  commonly  occur  In 
software.  A  series  of  fractional  factorial  designs  was  employed  In  the  analysis. 
Software  testing  and  defect  detection  are  Inexact  and  not  very  well  understood 
areas  of  software  production.  Yet  the  activities  of  testing  and  defect  detection 
are  essential  to  the  success  of  a  software  project.  The  three  Individual  tech¬ 
niques  (factors  F)  examined  were  code  reading,  functional  testing,  and  structural 
testing.  Result  statements  (statement  strengths  S)  characterizing,  evaluating, 
and  predicting  the  effect  of  each  of  these  techniques  are  Intended  from  the 
analysis.  The  major  area  of  benefit  from  the  analysis  will  be  Increasing  the 
effectiveness  of  software  testing  and  defect  detection.  The  goals  of  this  study 
are  to  contrast  the  strategies  In  three  different  aspects  of  software  testing:  1) 
fault  detection  effectiveness,  2)  fault  detection  cost,  and  3)  classes  of  faults 
detected. 

3.2.2.  Cleanroom  Development  Approach  Analysis 

The  Cleanroom  development  approach  analysis  Is  a  replicated  project  study 
In  which  15  small  teams  (3-person)  separately  applied  two  different  software  de¬ 
velopment  methodologies  to  build  versions  of  the  same  small  message  system: 
ten  teams  applied  Cleanroom,  while  five  applied  a  more  traditional  approach. 


The  Individuals  (domain  D  ,)  were  selected  from  the  populations  of  Junior  and 


Intermediate  programmers,  and  the  system  built  (domain  D  s )  was  a  small  sys¬ 
tem  selected  from  the  population  of  small  systems  of  moderate  complexity. 
Non-parametrlc  statistics  were  applied  to  contrast  the  performance  of  the  two 
development  methodologies.  The  outcome  of  a  software  project  Is  largely  a 
function  of  the  development  methodology  used,  and  the  software  community  Is 
uncertain  which  development  approaches  consistently  produce  a  quality  pro¬ 
duct.  The  two  development  methodologies  (factors  F)  examined  were  Clean- 
room  software  development  and  a  traditional  team  methodology.  The  Clean- 
room  software  development  approach  Is  Intended  to  produce  highly  reliable  soft¬ 
ware  by  Integrating  formal  methods  for  specification  and  design,  complete  off¬ 
line  development,  and  statistically  based  testing.  Result  statements  (statement 
strengths  S)  characterizing,  evaluating,  and  predicting  the  effect  of  the  two  de¬ 
velopment  methodologies  relative  to  one  another  are  Intended  from  the  analysis. 
The  major  area  of  benefit  from  the  analysis  will  be  Increasing  product  quality 
and  development  process  control.  This  study  analyzes  the  effect  of  Cleanroom, 
relative  to  a  traditional  approach,  on  the  delivered  product,  the  software  devel¬ 
opment  process,  and  the  developers. 

3.2.3.  Characteristic  Metric  Set  Study 

The  characteristic  metric  set  analysis  Is  a  multi-project  variation  study  in 
which  one  development  environment  applied  Its  methodology  to  6  software  pro¬ 
jects.  The  environment  (domain  D  was  selected  from  the  population  of  pro¬ 
duction  environments,  and  the  projects  developed  (domain  DA  were  large  sys- 


terns  selected  from  the  population  of  large,  moderately  complex  software  sys¬ 
tems.  The  quantitative  analysis  method  used  was  factor  analysis.  The  manage¬ 
ment  of  software  projects  Is  a  challenging  and  Ill-defined  task.  Better  monitor¬ 
ing  and  control  of  software  projects  lead  to  more  successful  project  manage¬ 
ment,  and  possibly  higher  product  requirement  conformance  and  reliability. 
The  environment  methodology  (factor  F)  examined  was  the  environment  meth¬ 
odology  of  a  NASA  Goddard  production  environment.  Result  statements  (state¬ 
ment  strengths  S)  characterizing,  evaluating,  and  predicting  the  effect  of  the 
particular  environment  methodology  on  projects  are  Intended  from  the  analysis. 
The  major  area  of  benefit  from  the  analysis  will  be  Increasing  the  ability  to 
monitor  and  control  software  projects.  The  goals  of  this  study  are  to  1)  develop 
an  approach  for  customizing  a  characteristic  software  metric  set  to  a  particular 
environment;  2)  calculate  the  characteristic  metric  set  for  a  NASA  Goddard  en¬ 
vironment;  and  3)  examine  the  usability  of  this  approach  as  a  management  tool. 

3.3.  Methodology  Application 

The  three  analyses  described  above  are  Intended  to  advance  the  under¬ 
standing  of  factors  that  contribute  to  quality  In  software  development  and 
modification.  The  next  section  presents  the  In-depth  analysis  for  each  of  the 
studies,  Including  the  goal/questlon  framework,  appropriate  software  metrics, 
data  analysis,  and  results. 


30 


4.  Evaluation  of  Software  Technologies:  Analysis  and  Results 

The  following  sections  present  three  studies  In  which  the  quantitative 
methodology  described  earlier  is  applied:  a  blocked  subject-project  study  com¬ 
paring  software  testing  strategies,  a  replicated  project  study  characterizing  the 
effect  of  using  the  Cleanroom  software  development  approach,  and  a  multi¬ 
project  variation  study  to  determine  a  characteristic  set  of  software  cost  and 
quality  metrics. 

4.1.  Software  Testing  Strategy  Comparison 

The  processes  of  software  testing  and  defect  detection  continue  to  challenge 
the  software  community.  Even  though  the  software  testing  and  defect  detection 
activities  are  Inexact  and  Inadequately  understood,  they  are  crucial  to  the  suc¬ 
cess  of  a  software  project.  The  controlled  study  presented  addresses  the  uncer¬ 
tainty  of  how  to  test  software  effectively.  In  this  Investigation,  common  testing 
techniques  were  applied  to  different  types  of  software  by  subjects  that  had  a 
wide  range  of  professional  experience.  This  work  Is  Intended  to  characterize 
how  testing  effectiveness  relates  to  several  factors:  testing  technique,  software 
type,  fault  type,  tester  experience,  and  any  Interactions  among  these  factors. 
This  examination  extends  previous  work  by  Incorporating  different  testing  tech¬ 
niques  and  a  greater  number  of  persons  and  programs,  while  broadening  the 
scope  of  Issues  examined  and  adding  statistical  significance  to  the  conclusions. 

This  section  describes  the  testing  techniques  examined,  the  Investigation 


goals,  the  experimental  design,  operation,  analysis,  and  conclusions. 


4.1.1.  Testing  Techniques 


To  demonstrate  that  a  particular  program  actually  meets  Its  specifications, 
professional  software  developers  currently  utilize  many  different  testing 
methods.  Before  presenting  the  goals  for  the  empirical  study  comparing  the  po¬ 
pular  techniques  of  code  reading,  functional  testing,  and  structural  testing,  a 
description  will  be  given  of  the  testing  strategies  and  their  different  capabilities 
(see  Figure  4.).  In  functional  testing,  which  Is  a  ‘‘black  box”  approach  [Howden 
80],  a  programmer  constructs  test  data  from  the  program’s  specification  through 
methods  such  as  equivalence  partitioning  and  boundary  value  analysis  [Myers 
79].  The  programmer  then  executes  the  program  and  contrasts  Its  actual 
behavior  with  that  Indicated  In  the  specification.  In  structural  testing,  which  Is 
a  "white  box"  approach  [Howden  7S,  Howden  81],  a  programmer  Inspects  the 
source  code  and  then  devises  and  executes  test  cases  based  on  the  percentage  of 
the  program’s  statements  or  expressions  executed  (the  "test  set  coverage”) 
[Stuckl  77].  The  structural  coverage  criteria  used  was  100%  statement  cover¬ 
age.  In  code  reading  by  stepwise  abstraction,  a  person  Identifies  prime  subpro¬ 
grams  In  the  software,  determines  their  functions,  and  composes  these  functions 
to  determine  a  function  for  the  entire  program  [Mills  72a.  Linger,  Mills  Sc  Witt 
79].  The  code  reader  then  compares  this  derived  function  and  the  specifications 
(the  intended  function).  In  order  to  contrast  these  various  strategies,  an  empiri¬ 


cal  study  has  been  conducted  using  the  techniques  of  code  reading,  functional 


testing,  and  structural  testing. 


Figure  4.  Capabilities  of  the  testing  methods. 

code  reading 

functional 

testing 

structural 

testing 

view  program 
specification 

X 

X 

X 

view  source 

code 

X 

X 

execute 

program 

X 

X 

4. 1.1.1.  Investigation  Goals 


The  goals  of  this  study  comprise  three  different  aspects  of  software  testing: 
fault  detection  effectiveness,  fault  detection  cost,  and  classes  of  faults  detected. 
An  application  of  the  goal/questlon/metrlc  paradigm  [Baslll  &  Selby  84,  Baslll 
<&  Weiss  84]  leads  to  the  framework  of  goals  and  questions  for  this  study  ap¬ 
pearing  In  Figure  5. 

The  first  goal  area  Is  performance  oriented  and  includes  a  natural  first 
question  (I.A):  which  of  the  techniques  detects  the  most  faults  In  the  programs? 
The  comparison  between  the  techniques  Is  being  made  across  programs,  each 
with  a  different  number  of  faults.  An  alternate  Interpretation  would  then  be  to 
compare  the  percentage  of  faults  found  in  the  programs  (question  I.A.l).  The 
number  of  faults  that  a  technique  exposes  should  also  be  compared;  that  Is, 
faults  that  are  made  observable  but  not  necessarily  observed  and  reported  by  a 
tester  (LA. 2).  Because  of  the  differences  In  types  of  software  and  In  testers' 
abilities.  It  is  relevant  to  determine  whether  the  number  of  faults  detected  Is  el- 


33 


ther  program  or  programmer  dependent  (I.B,  I.C).  Since  one  technique  may- 


find  a  few  more  faults  than  another,  It  becomes  useful  to  know  how  much  effort 
that  technique  requires  (II.A).  Awareness  of  what  types  of  software  require 
more  effort  to  test  (n.B)  and  what  types  of  programmer  backgrounds  require 
less  effort  In  fault  uncovering  (II.C)  Is  also  quite  useful.  If  one  Is  Interested  In 
detecting  certain  classes  of  faults,  such  as  In  error-based  testing  [Foster  80, 
Valdes  &  Goel  S3],  It  Is  appropriate  to  apply  a  technique  sensitive  to  that  par¬ 
ticular  type  (III.A).  Classifying  the  types  of  faults  that  are  observable  yet  go 
unreported  could  help  focus  and  Increase  testing  effectiveness  (III.B). 


Figure  5.  Structure  of  goals/subgoals/questlons  for  testing  experiment. 


I.  Fault  detection  effectiveness 

A.  For  programmers  doing  unit  testing,  which  of  the  testing  techniques 

(code  reading,  functional  testing,  or  structural  testing)  detects  the 
most  faults  In  programs? 

1.  Which  of  the  techniques  detects  the  greatest  percentage  of  faults  In 

the  programs  (the  programs  each  contain  a  different  number  of 
faults)? 

2.  Which  of  the  techniques  exposes  the  greatest  number  (or  percentage) 

of  program  faults  (faults  that  are  observable  but  not  necessarily 
reported)? 

B.  Is  the  number  of  faults  observed  dependent  on  software  type? 

C.  Is  the  number  of  faults  observed  dependent  on  the  expertise  level  of  the 

person  testing? 


II.  Fault  detection  cost 


A.  For  programmers  doing  unit  testing,  which  of  the  testing  techniques 
(code  reading,  functional  testing,  or  structural  testing)  detects  the 


faults  at  the  highest  rate  (#faults/effort)? 

B.  Is  the  fault  detection  rate  dependent  on  software  type? 

C.  Is  the  fault  detection  rate  dependent  on  the  expertise  level  of  the  person 

testing? 

HI.  Classes  of  faults  observed 

A.  For  programmers  doing  unit  testing,  do  the  methods  tend  to  capture 

different  classes  of  faults? 

B.  What  classes  of  faults  are  observable  but  go  unreported? 


4.1.2.  Empirical  Study 

Admittedly,  the  goals  stated  here  are  quite  ambitious.  In  no  way  Is  It  Im¬ 
plied  that  this  study  can  definitively  answer  all  of  these  questions  for  all  en¬ 
vironments.  It  Is  Intended,  however,  that  the  statistically  significant  analysis 
presented  lends  Insights  Into  their  answers  and  Into  the  merit  and  appropriate¬ 
ness  of  each  of  the  techniques.  Note  that  this  study  compares  the  Individual 
application  of  the  three  testing  techniques  In  order  to  Identify  their  distinct  ad¬ 
vantages  and  disadvantages.  This  approach  Is  a  first  step  toward  proposing  a 
composite  testing  strategy,  which  possibly  Incorporates  several  testing  methods. 
The  following  sections  describe  the  empirical  study  undertaken  to  pursue  these 
goals  and  questions,  Including  the  selection  of  subjects,  programs,  and  experi¬ 


mental  design,  and  the  overall  operation  of  the  study. 


4. 1.2.1.  Iterative  Experimentation 


The  empirical  study  consisted  of  three  phases.  The  first  and  second  phases 
of  the  study  took  place  at  the  University  of  Maryland  In  the  Falls  of  1982  and 
1983  respectively.  The  third  phase  took  place  at  Computer  Sciences  Corpora¬ 
tion  (CSC  -  Sliver  Spring,  MD)  and  NASA  Goddard  Space  Flight  Center 
(Greenbelt,  MD)  in  the  Fall  of  1984.  The  sequential  experimentation  supported 
the  Iterative  nature  of  the  learning  process,  and  enabled  the  Initial  set  of  goals 
and  questions  to  be  expanded  and  resolved  by  further  analysis.  The  goals  were 
further  refined  by  discussions  of  the  preliminary  results  [Selby  83,  Selby  84]. 
These  three  phases  enabled  the  pursuit  of  result  reproducibility  across  environ¬ 
ments  having  subjects  with  a  wide  range  of  experience. 

4. 1.2. 2.  Subject  and  Program/Fault  Selection 

A  primary  consideration  In  this  study  was  to  use  a  realistic  testing  environ¬ 
ment  to  assess  the  effectiveness  of  these  different  testing  strategies,  as  opposed 
to  creating  a  best  possible  testing  situation  [Hetzel  76].  Thus,  1)  the  subjects 
for  the  study  were  chosen  to  be  representative  of  different  levels  of  expertise,  2) 
the  programs  tested  correspond  to  different  types  of  software  and  reflect  com¬ 
mon  programming  style,  and  3)  the  faults  In  the  programs  were  representative 
of  those  frequently  occurring  In  software.  Sampling  the  subjects,  programs,  and 
faults  In  this  manner  Is  Intended  to  evaluate  the  testing  methods  reasonably, 
and  to  facilitate  the  generalization  of  the  results  to  other  environments. 


4. 1.2. 2.1.  Subjects 


The  three  phases  of  the  study  Incorporated  a  total  of  7-4  subjects;  the  Indi¬ 
vidual  phases  had  29,  13,  and  32  subjects  respectively.  The  subjects  were 
selected,  based  on  several  criteria,  to  be  representative  of  three  different  levels 
of  computer  science  expertise:  advanced.  Intermediate,  and  Junior.  The  number 
of  subjects  In  each  level  of  expertise  for  the  different  phases  appears  In  Figure  6. 


Figure  6. 

Expertise  levels  of  subjects. 

Phase 

Level  of 

m 

3 

total 

Expertise 

(Univ.  Md) 

(NASA/ CSC) 

Advanced 

0 

0 

8 

8 

Intermediate 

9 

4 

11 

24 

Junior 

20 

9 

13 

42 

total 

29 

13 

32 

74 

The  42  subjects  In  the  first  two  phases  of  the  study  were  the  members  of 
the  upper  level  “Software  Design  and  Development"  course  at  the  University  of 
Maryland  In  the  Falls  of  1982  and  1983.  The  Individuals  were  either  upper-level 
computer  science  majors  or  graduate  students;  some  were  working  part-time 
and  all  were  In  good  academic  standing.  The  topics  of  the  course  Included 
structured  programming  practices,  functional  correctness,  top-down  design, 
modular  specification  and  design,  step-wise  refinement,  and  PDL,  In  addition  to 
the  presentation  of  the  techniques  of  code  reading,  functional  testing,  and  struc¬ 
tural  testing.  The  references  for  the  testing  methods  were  [Mills  75.  Fagan  76, 
Myers  79.  Howden  80],  and  the  lectures  were  presented  by  V.  R.  Baslll  and  F. 


T.  Baker.  The  subjects  from  the  University  of  Maryland  spanned  the  Inter- 


mediate  and  Junior  levels  of  computer  science  expertise.  The  assignment  of  In¬ 
dividuals  to  levels  of  expertise  was  based  on  professional  experience  and  prior 
academic  performance  In  relevant  computer  science  courses.  The  Individuals  In 
the  first  and  second  phases  had  overall  averages  of  1.7  (SD  =  1.7)  and  1.5  (SD 
=  1.5)  years  of  professional  experience.  The  nine  Intermediate  subjects  In  the 
first  phase  had  from  2.8  to  7  years  of  professional  experience  (average  of  3.9 
years,  SD  =  1.3),  and  the  four  In  the  second  phase  had  from  2.3  to  5.5  years  of 
professional  experience  (average  of  3.2,  SD  =  1.5).  The  twenty  Junior  subjects 
In  the  first  phases  and  the  nine  In  the  second  phase  both  had  from  0  to  2  years 
professional  experience  (averages  of  0.7,  SD  =  0.8,  and  0.8,  SD  =  0.8,  respec¬ 
tively). 

The  32  subjects  In  the  third  phase  of  the  study  were  programming  profes¬ 
sionals  from  NASA  and  Computer  Sciences  Corporation.  These  Individuals 
were  mathematicians,  physicists,  and  engineers  that  develop  ground  support 
software  for  satellites.  They  were  familiar  with  ail  three  testing  techniques,  but 
had  used  functional  testing  primarily.  A  four  hour  tutorial  on  the  testing  tech¬ 
niques  was  conducted  for  the  subjects  by  R.  W.  Selby.  This  group  of  subjects, 
examined  in  the  third  phase  of  the  experiment,  spanned  all  three  expertise  levels 
and  had  an  overall  average  of  10.0  (SD  =  5.7)  years  professional  experience. 
Several  criteria  were  considered  In  the  assignment  of  subjects  to  expertise  levels. 
Including  years  of  professional  experience,  degree  background,  and  their 
manager's  suggested  assignment.  The  eight  advanced  subjects  ranged  from  9.5 
to  20.5  years  professional  experience  (average  of  15.0.  SD  =  4.1).  The  eleven 


38 


Intermediate  subjects  ranged  from  3.5  to  17.5  years  experience  (average  of  10.9, 
SD  =  4.9).  The  thirteen  Junior  subjects  ranged  from  1.5  to  13.5  years  experi¬ 
ence  (average  of  0.1,  SD  =  4.4). 


4. 1.2. 2. 2.  Programs 


The  experimental  design  enables  the  distinction  of  the  testing  techniques 
while  allowing  for  the  effects  of  the  different  programs  being  tested.  The  four 
programs  used  In  the  Investigation  were  chosen  to  be  representative  of  several 
different  types  of  software.  The  programs  were  selected  specially  for  the  study 
and  were  provided  to  the  subjects  for  testing;  the  subjects  did  not  test  programs 
that  they  had  written.  All  programs  were  written  In  a  high-level  language  with 
which  the  subjects  were  familiar.  The  three  programs  tested  In  the  CSC/NASA 
phase  were  written  In  FORTRAN,  and  the  programs  tested  In  the  University  of 
Maryland  phases  were  written  in  the  Slmpl-T  structured  programming  language 
[Baslll  &  Turner  76].  2  The  four  programs  tested  were  P  {)  a  text  processor,  P  2) 
a  mathematical  plotting  routine,  P  s)  a  numeric  abstract  data  type,  and  P  a 
database  maintained  The  programs  are  summarized  In  Figure  7.  There  exists 
some  differentiation  In  size,  and  the  programs  are  a  realistic  size  for  unit  testing. 
Each  of  the  subjects  tested  three  programs,  but  a  total  of  four  programs  was 
used  across  the  three  phases  of  the  study.  The  programs  tested  In  each  of  the 
three  phases  of  the  study  appear  In  Figure  8.  The  specifications  for  the  pro- 


blmpl-T  Is  a  structured  language  that  supports  several  string  and  the  han¬ 
dling  primitives.  In  addition  to  the  usual  control  flow  constructs  available,  for 
example.  In  Pascal. 


Figure  7.  The  programs  tested. 

program 

source 

lines 

executable 

statments 

cyclomatlc 

complexity 

^routines 

^faults 

P x  -  text 
formatter 

169 

55 

18 

3 

9 

P  2  -  mathematical 
plotting 

145 

95 

32 

8 

6 

P  s  ~  numeric  data 
abstraction 

147 

48 

18 

9 

7 

P  4  -  database 
malntalner 

365 

144 

57 

7 

12 

Figure  8.  Programs  tested  In  each  phase  of  the  analysis. 

Program 

Phase 

m 

2 

3 

fUnlv.  Md) 

(NASA/CSC) 

P 1  -  text  formatter 

X 

X 

X 

P  s~  mathematical  plotting 

X 

X 

P  s  ~  numeric  data  abstraction 

X 

X 

P  j  -  database  malntalner 

X 

X 

The  first  program  Is  a  text  formatting  program,  which  also  appeared  In 
[Myers  78].  A  version  of  this  program,  originally  written  by  [Naur  69]  using 
techniques  of  program  correctness  proofs,  was  analyzed  In  [Goodenough  Sc 
Gerhart  75].  The  second  program  Is  a  mathematical  plotting  routine.  This  pro¬ 
gram  was  written  by  R.  W.  Selby,  based  roughly  on  a  sample  program  In  [Jen¬ 
sen  Sc  Wlrth  7-4].  The  third  program  Is  a  numeric  data  abstraction  consisting  of 
a  set  of  list  processing  utilities.  This  program  was  submitted  for  a  class  project 


by  a  member  of  an  Intermediate  level  programming  course  at  the  University  of 


Maryland.  [McMullln  Sc  Gannon  SOj.  The  fourth  program  Is  a  malntalner  for  a 


-  V  ^’.  TT.'T.i'.  *•'-  r-.  «r_  ■»- ,  w_  ¥ 


database  of  bibliographic  references.  This  program  was  analyzed  In  [Hetzel  76], 
and  was  written  by  a  systems  programmer  at  the  University  of  North  Carolina 
computation  center. 

Note  that  the  source  code  for  the  programs  contains  no  comments.  This 
creates  a  worst-case  situation  for  the  code  readers.  In  an  environment  where 
code  contained  helpful  comments,  performance  of  code  readers  would  likely  Im¬ 
prove,  especially  If  the  source  code  contained  as  comments  the  Intermediate 
functions  of  the  program  segments.  In  an  environment  where  the  comments 
were  at  all  suspect,  they  could  then  be  Ignored. 

4. 1.2. 2. 3.  Faults 

The  faults  contained  In  the  programs  tested  represent  a  reasonable  distri¬ 
bution  of  faults  that  commonly  occur  In  software  [Weiss  &  Baslll  S5,  Baslll  & 
Perrlcone  8-4].  All  the  faults  In  the  database  malntalner  and  the  numeric 
abstract  data  type  were  made  during  the  actual  development  of  the  programs. 
The  other  two  programs  contain  a  mix  of  faults  made  by  the  original  program¬ 
mer  and  faults  seeded  In  the  code.  The  programs  contained  a  total  of  34  faults; 
the  text  formatter  had  nine,  the  plotting  routine  had  six,  the  abstract  data  type 
had  seven,  and  the  database  malntalner  had  twelve. 

4. 1.2. 2. 3.1.  Fault  Origin 

The  faults  In  the  text  formatter  were  preserved  from  the  article  In  which  It 
appeared  [Myers  78],  except  for  some  of  the  more  controversial  ones  [Callllau  £ 
Rubin  79] .  In  the  mathematical  plotter,  faults  made  during  program  translation 


41 


V  ".■'  •  * 


T 


T 


were  supplemented  by  additional  representative  faults.  The  faults  In  the 
abstract  data  type  were  the  original  ones  made  by  the  program’s  author  during 
the  development  of  the  program.  The  faults  In  the  database  malntalner  were 
recorded  during  the  development  of  the  program,  and  then  reinserted  Into  the 
program.  The  next  section  describes  a  classification  of  the  different  types  of 
faults  In  the  programs.  Note  that  this  Investigation  of  the  fault  detecting  abili¬ 
ty  of  these  techniques  Involves  only  those  types  occurring  In  the  source  code, 
not  other  types  such  as  those  In  the  requirements  or  the  specifications. 

4. 1.2. 2. 3. 2.  Fault  Classification 

The  faults  In  the  programs  are  classified  according  to  two  different  abstract 
classification  schemes  [Baslll  &  Perrlcone  84].  One  fault  categorization  method 
separates  faults  of  omission  from  faults  of  commission.  Faults  of  commission 
are  those  faults  present  as  a  result  of  an  Incorrect  segment  of  existing  code.  For 
example,  the  wrong  arithmetic  operator  Is  used  for  a  computation  In  the  right- 
hand-side  of  an  assignment  statement.  Faults  of  omission  are  those  faults 
present  as  a  result  of  a  programmer’s  forgetting  to  Include  some  entity  In  a 
module.  For  example,  a  statement  Is  missing  from  the  code  that  would  assign 
the  proper  value  to  a  variable. 

A  second  fault  categorization  scheme  partitions  software  faults  Into  the  six 
classes  of  1)  Initialization,  2)  computation.  3)  control.  4)  Interface,  5)  data,  and 
b)  cosmetic.  Improperly  Initializing  a  data  structure  constitutes  an  Initialization 
fault.  For  example,  assigning  a  variable  the  wrong  value  on  entry  to  a  module. 


42 


v-j fk'* t1'  v  ■  u  ■  i »  irrw pj'jyjfl 


Computation  faults  are  those  that  cause  a  calculation  to  evaluate  the  value  for 
a  variable  Incorrectly.  The  above  example  of  a  wrong  arithmetic  operator  In 
the  rlght-hand-slde  of  an  assignment  statement  would  be  a  computation  fault. 
A  control  fault  causes  the  wrong  control  flow  path  In  a  program  to  be  taken  for 
some  Input.  An  Incorrect  predicate  In  an  IF-THEN-ELSE  statement  would  be  a 
control  fault.  Interface  faults  result  when  a  module  uses  and  makes  assump¬ 
tions  about  entitles  outside  the  module's  local  environment.  Interface  faults 
would  be,  for  example,  passing  an  Incorrect  argument  to  a  procedure,  or  assum¬ 
ing  In  a  module  that  an  array  passed  as  an  argument  was  filled  with  blanks  by 
the  passing  routine.  A  data  fault  are  those  that  result  from  the  Incorrect  use  of 
a  data  structure.  For  example,  Incorrectly  determining  the  Index  for  the  last 
element  in  an  array.  Finally,  cosmetic  faults  are  clerical  mistakes  when  entering 
the  program.  A  spelling  mistake  In  an  error  message  would  be  a  cosmetic  fault. 

Interpreting  and  classifying  faults  In  software  Is  a  difficult  and  inexact  task. 
The  categorization  process  often  requires  trying  to  recreate  the  original 
programmer’s  misunderstanding  of  the  problem  [Johnson,  Draper  &  Soloway 
83].  The  above  two  fault  classification  schemes  attempt  to  distinguish  among 
different  reasons  that  programmers  make  faults  In  software  development.  They 
were  applied  to  the  faults  In  the  programs  In  a  consistent  Interpretation:  It  Is 
certainly  possible  that  another  analyst  could  have  Interpreted  them  differently. 
The  separate  application  of  each  of  the  two  classification  schemes  to  the  faults 
categorized  them  In  a  mutually  exclusive  and  exhaustive  manner.  Figure  9 
displays  the  distribution  of  faults  In  the  programs  according  to  these  schemes. 


43 


Figure  9.  Distribution  of  faults  In  the  programs. 


Initialization 

Computation 

Control 

Interface 

Data 

Cosmetic 


Total 


Omission 

Commission 

Total 

0 

2 

2 

4 

4 

8 

2 

5 

7 

2 

11 

13 

2 

1 

3 

0 

1 

1 

10 

24 

34 

4. 1.2.2. 3.3.  Fault  Description 

The  faults  In  the  programs  are  described  In  Figure  10.  There  have  been 
various  efforts  to  determine  a  precise  counting  scheme  for  “defects''  In  software 
(Gloss-Soler  79,  IEEE  83].  According  to  the  explanations  given,  a  software 
"fault"  Is  a  specific  manifestation  In  the  source  code  of  a  programmer  “error.’’ 
For  example,  due  to  a  misconception  or  document  discrepancy,  a  programmer 
commits  an  "error"  (In  his/her  head)  that  may  result  In  more  than  one  "fault” 
In  a  program.  Using  this  Interpretation,  software  "faults"  reflect  the  correct¬ 
ness,  or  lack  thereof,  In  a  program.  The  entitles  examined  In  this  analysis  are 
software  faults. 


Figure  10.  Fault  classification  and  manifestation 


FaultProgram  Omission/  Class 
Commission 


Description 


omission 


a 

PI 

omission 

control 

a  blank  Is  printed  before  the  first  word 
on  the  first  line  unless  the  first  word  Is 
30  characters  long;  In  the  latter  case,  a 
blank  line  Is  printed  before  the  first 
word 

b 

Pi 

commission  Inltlallzatlonthe  character  &  (not  $)  Is  the  new-line 

character 

c 

Pi 

commission  Inltlallzatlonthe  line  size  Is  31  characters  (not  30); 

this  fault  causes  the  references  to  the 

number  30  In  the  other  faults  to  be  ac¬ 
tually  the  number  31 

d 

PI 

commission 

Interface 

since  the  program  pads  an  empty  Input 
buffer  with  the  character  ”z,”  It  Ignores 
a  valid  Input  line  that  has  a  ”z”  as  a 
first  character 

e 

Pi 

omission 

control 

successive  break  characters  are  not  con¬ 
densed  In  the  output 

f 

PI 

commission 

cosmetic 

spelling  mistake  In  the  error  message 
"***  word  to  long  ***” 

g 

PI 

commission  computation  after  detecting  a  word  In  the  Input 

longer  than  30  characters,  the  message 
"***  word  to  long  ***"  Is  printed  once 
for  every  character  over  30,  and  the  pro¬ 
cessing  of  the  text  does  not  terminate 

h 

Pi 

omission 

Interface 

after  detecting  a  word  In  the  Input 
longer  than  30  characters,  the  program 
prints  whatever  Is  residing  In  Its  output 
buffer 

1 

PI 

commission 

control 

after  detecting  an  Input  line  without  an 
end-of-text  character,  the  program  er¬ 
roneously  Increments  Its  buffer  pointer 
and  replaces  the  first  character  of  the 
next  Input  line  with  a  "z” 

J 

P3 

commission 

Interface 

routine  FIRST  returns  zero  (0)  when  the 
list  has  one  element 

k 

P3 

commission 

Interface 

routine  ISEMPTY  returns  true  (1)  when 
the  list  has  one  element 

1 

P3 

commission 

Interface 

routine  DELETEFIRST  can  not  delete 

the  first  list  element  when  the  list  has 

only  one  element 

m 

P3 

commission 

Interface 

routine  LISTLENGTH  returns  one  less 

than  than  the  actual  length  of  the  list 


n 

P3 

commission 

Interface 

routine  ADDFIRST  can  add  more  than 
the  specified  five  elements  to  the  list 

o 

P3 

commission 

Interface 

routine  ADDLAST  can  add  more  than 
the  specified  five  elements  to  the  list 

P 

P3 

omission 

computation  routine  REVERSE  does  not  reverse  the 
list  properly  when  the  list  has  more  than 
one  element 

q 

P4 

commission  computation  words  greater  than  or  equal  to  three 

characters  (not  strictly  greater  than)  are 
treated  as  cross  reference  keywords 

r 

P4 

commission 

Interface 

since  the  program  uses  the  key  "ZZZ"  as 
an  end-of-lnput  sentinel,  It  does  not  pro¬ 
cess  a  valid  record  with  key  "ZZZ”  and 
Ignores  any  following  records 

s 

P4 

commission 

control 

update  action  add  with  the  error  condi¬ 
tion  "key  already  In  the  master  file”  re¬ 
places  the  existing  record;  the  update 
record  Is  not  Ignored 

t 

P4 

commission 

control 

update  action  replace  with  the  error  con¬ 
dition  "key  not  found  In  the  master  file” 
adds  the  record;  the  update  record  Is  not 
Ignored 

u 

P4 

omission 

data 

the  number  of  references  and  number  of 
words  In  the  dictionary  are  not  checked 
for  overflow 

V 

P4 

omission 

computation  two  or  more  ,  update  transactions  for  the 
same  master  record  give  Incorrect  results 

w 

P4 

commission 

Interface 

keywords  longer  than  12  characters  are 
truncated  and  not  distinguished 

X 

P4 

commission 

control 

an  update  record  with  column  80  neither 
an  add  action  "A"  nor  replace  action 
”R"  acts  like  an  add  transaction 

y 

P4 

commission 

Interface 

keyword  Indices  appear  In  reverse  alpha¬ 
betical  order 

z 

P4 

omission 

Interface 

no  check  Is  made  for  unique  keys  In  the 
master  file 

A 

P4 

commission 

Interface 

punctuation  Is  made  a  part  of  the  key¬ 
word 

B 

P4 

omission 

data 

words  appearing  twice  In  a  title  get  two 
cross  reference  entries 

P2  commission  computatlonthe  x  and  y  axes  are  mislabeled 


C 


E 


E 

P2 

commission  control 

the  origin  (0,0)  appears  on  the  graph  re¬ 
gardless  of  whether  It  Is  an  Input  point 

F 

P2 

commission  data 

no  points  can  appear  on  the  vertical  axis 

G 

P2 

commission  computation  the  vertical  and  horizontal  scaling  for 

the  pixels  are  calculated  Incorrectly, 
causing  some  points  not  to  appear  In  the 
proper  pixel 

H 

P2 

omission  computation  when  more  than  one  point  would  appear 

In  a  given  pixel,  only  an  asterisk  (*)  ap¬ 
pears,  not  an  appropriate  Integer 

4. 1.2.3.  Experimental  Design 

The  experimental  design  applied  for  each  of  the  three  phases  of  the  study 
was  a  fractional  factorial  design  [Cochran  &  Cox  50,  Box,  Hunter,  8i  Hunter  78]. 
This  experimental  design  distinguishes  among  the  testing  techniques,  while  al¬ 
lowing  for  variation  In  the  ability  of  the  particular  Individual  testing  or  In  the 
program  being  tested.  Figure  11  displays  the  fractional  factorial  design  ap¬ 
propriate  for  the  third  phase  of  the  study.  Subject  S 1  Is  In  the  advanced  exper¬ 
tise  level,  and  he  structurally  tested  program  P  v  functionally  tested  program 
P  3,  and  code  read  program  P  ^  Notice  that  all  of  the  subjects  tested  each  of 
the  three  programs  and  used  each  of  the  three  techniques.  Of  course,  no  one 
tests  a  given  program  more  than  once.  The  design  appropriate  for  the  third 
phase  Is  discussed  in  the  following  paragraphs,  with  the  minor  differences 
between  this  design  and  the  ones  applied  In  the  first  two  phases  being  discussed 


at  the  end  of  the  section. 


Figure  11.  Fractional  Factorial  Design. 


Code  Functional 

Reading  Testing 


PsP4 


Advanced 

Subjects 

5, 

S8 

s9 

Inter¬ 

S  to 

mediate 

Subjects 

• 

S  ig 

S 20 

Junior 

S 21 

Subjects 

• 

S 32 

Structural 

Testing 


PSP4 


4. 1.2. 3.1.  Independent  and  Dependent  Variables 

The  experimental  design  has  the  three  Independent  variables  of  testing 
technique,  software  type,  and  level  of  expertise.  For  the  design  appearing  In 
Figure  11,  appropriate  for  the  third  phase  of  the  study,  the  three  main  effects 
have  the  following  levels: 

1)  testing  technique:  code  reading,  functional  testing,  and  structural  testing 

2)  software  type:  ( P :)  text  processing,  (P  3 )  numeric  abstract  data  type,  and 

(P40  database  malntalner 

3)  level  of  expertise:  advanced,  Intermediate,  and  Junior 

Every  combination  of  these  levels  occurs  In  the  design.  That  Is.  programmers  In 


all  three  levels  of  expertise  applied  all  three  testing  techniques  on  all  programs. 


In  addition  to  these  three  main  effects,  a  factorial  analysis  of  variance  (ANOVA) 
model  supports  the  analysis  of  interactions  among  each  of  these  main  effects. 
Thus,  the  interaction  effects  of  testing  technique  *  software  type,  testing  tech- 
nlq  *  expertise  level,  software  type  *  expertise  level,  and  the  three-way  In¬ 
teraction  of  testing  technique  *  software  type  *  expertise  level  are  Included  In 
the  model.  There  are  several  dependent  variables  examined  In  the  study,  In¬ 
cluding  number  of  faults  detected,  percentage  of  faults  detected,  total  fault 
detection  time,  and  fault  detection  rate.  Observations  from  the  on-line  methods 
of  functional  and  structural  testing  also  had  as  dependent  variables  number  of 
computer  runs,  amount  of  cpu-tlme  consumed,  maximum  statement  coverage 
achieved,  connect  time  used,  number  of  faults  that  were  observable  from  the 
test  data,  percentage  of  faults  that  were  observable  from  the  test  data,  and  per¬ 
centage  of  faults  observable  In  the  from  the  test  data  that  were  actually  ob¬ 
served  by  the  tester. 

4. 1.2.3. 2.  Analysis  of  Variance  Model 

The  three  main  effects  and  all  the  two-way  and  three-way  interactions 
effects  are  called  fixed  effects  In  this  factorial  analysis  of  variance  model.  The 
levels  of  these  effects  given  above  represent  all  levels  of  Interest  In  the  Investiga¬ 
tion.  For  example,  the  effect  of  testing  technique  has  as  particular  levels  code 
reading,  functional  testing,  and  structural  testing;  these  particular  testing  tech¬ 
niques  are  the  only  ones  under  comparison  In  this  study.  The  effect  of  the  par- 


9 


tlcular  subjects  that  participated  In  this  study  requires  a  little  different  Interpre¬ 
tation.  The  subjects  examined  In  the  study  were  random  samples  of  program¬ 
mers  from  the  large  population  of  programmers  at  each  of  the  levels  of  exper¬ 
tise.  Thus,  the  effect  of  the  subjects  on  the  various  dependent  variables  Is  a 
random  variable,  and  this  effect  therefore  Is  called  a  random  effect.  If  the  sam¬ 
ples  examined  are  truly  representative  of  the  population  of  subjects  at  each  ex¬ 
pertise  level,  the  Inferences  from  the  analysis  can  then  be  generalized  across  the 
whole  population  of  subjects  at  each  expertise  level,  not  Just  across  the  particu¬ 
lar  subjects  In  the  sample  chosen.  Since  this  analysis  of  variance  model  contains 
both  fixed  and  random  effects,  It  Is  called  a  mixed  model.  The  actual  ANOVA 
model  for  the  design  appearing  In  Figure  11  Is  given  below. 


T tjki  =  M  3-  a,  +  0j  +  Ik  4-  8k{  +  0(/3tj  +  a^,k  +  ,3~i;-k  +  at8~iijk  + 
eijkl 

where 

T jjkl  Is  the  observed  response  from  subject  1  of  experience  level  k  using 
testing  technique  1  on  program  J 
H  Is  the  overall  mean  response 

a,-  Is  the  main  effect  of  testing  technique  1  (1  =  1,2.3) 

8j  Is  the  main  effect  of  program  J  (J  =  1,  3.  4) 

~jif  Is  the  main  effect  of  expertise  level  k  (k  =  1,  2.  3) 

8kt  Is  the  random  effect  of  subject  1  within  expertise  level  k,  a  random 
variable  (1  =  1,  2,  ....  32;  k  =  1,  2,  3) 
a 3^  Is  the  Interaction  effect  of  testing  technique  1  with  program  J  (1  =  1. 

2,  3;  J  =  1,  3,  4) 

a~<ik  Is  the  Interaction  effect  of  testing  technique  1  with  expertise  level  k 
(1  =  1,  2,  3;  k  =  1.  2,  3) 

3~; ,k  Is  the  Interaction  effect  of  program  J  with  expertise  level  k  (J  =  1, 

3.  4:  k  =  1,  2,  3) 

0.3-^, k  Is  the  Interaction  effect  of  testing  technique  I  with  program  J  with 


50 


experience  level  k  (1  =  1,  2,  3;  J  =  1,  3,  4;  k  =  1,  2,  3) 
eijkl  Is  the  experimental  error  for  each  observation,  a  random  variable 

The  F  tests  of  hypotheses  on  all  the  fixed  effects  mentioned  above  use  the 
error  (residual)  mean  square  In  the  denominator,  except  for  the  test  of  the  ex¬ 
pertise  level  effect.  The  expected  mean  square  for  the  expertise  level  effect  con¬ 
tains  a  component  for  the  actual  variance  of  subjects  within  expertise  level.  In 
order  to  select  the  appropriate  error  term  for  the  denominator  of  the  expertise 
level  F  test,  the  mean  square  for  the  effect  of  subjects  nested  within  expertise 
level  Is  chosen.  The  parameters  for  the  random  effect  of  subjects  within  exper¬ 
tise  level  are  assumed  to  be  drawn  from  a  normally  distributed  random  process 
with  mean  zero  and  common  variance.  The  experimental  error  terms  are  as¬ 
sumed  to  have  mean  zero  and  common  variance. 

The  fractional  factorial  design  applied  In  the  first  two  phases  of  the 
analysis  differed  slightly  from  the  one  presented  above  for  the  third  phase.3  In 
the  third  phase  of  the  study,  programs  P v  P s,  and  P ^  were  tested  by  subjects 
In  three  levels  of  expertise.  In  both  phases  one  and  two,  there  were  only  sub¬ 
jects  from  the  levels  of  intermediate  and  Junior  expertise.  In  phase  one.  pro¬ 
grams  P  v  P  s,  and  P  o  were  tested.  In  phase  two.  the  programs  tested  were  P  ,. 
P  g,  and  P  /.  The  only  modifications  necessary  to  the  above  explanation  for 
phases  one  and  two  are  1)  eliminating  the  advanced  expertise  level.  2)  changing 

3  Although  the  data  from  all  the  phases  can  be  analyzed  together,  the 
number  of  empty  cells  resulting  from  not  having  all  three  experience  levels  and 
all  four  programs  in  all  phases  limits  the  number  of  parameters  that  can  be  es¬ 
timated  and  causes  non-unique  Type  IV  partial  sums  of  squares. 


the  program  P  subscripts  appropriately,  and  3)  leaving  out  the  three  way  In¬ 
teraction  term  In  phase  two,  because  of  the  reduced  number  of  subjects.  In  all 
three  of  the  phases,  all  subjects  used  each  of  the  three  techniques  and  tested 
each  of  the  three  programs  for  that  phase.  Also,  within  all  three  phases,  all  pos¬ 
sible  combinations  of  expertise  level,  testing  techniques,  and  programs  occurred. 

The  order  of  presentation  of  the  testing  techniques  was  randomized  among 
the  subjects  In  each  level  of  expertise  in  each  phase  of  the  study.  However,  the 
Integrity  of  the  results  would  have  suffered  If  each  of  the  programs  In  a  given 
phase  was  tested  at  different  times  by  different  subjects.  Note  that  each  of  the 
testing  sessions  took  place  on  a  different  day  because  of  the  amount  of  effort  re¬ 
quired.  If  different  programs  would  have  been  tested  on  different  days,  any  dis¬ 
cussion  about  the  programs  among  subjects  between  testing  sessions  would  have 
affected  the  future  performance  of  others.  Therefore,  all  subjects  In  a  phase 
tested  the  same  program  on  the  same  day.  The  actual  order  of  program  presen¬ 
tation  was  the  order  In  which  the  programs  are  listed  In  the  previous  paragraph. 

4. 1.2. 4.  Experimental  Operation 

Each  of  the  three  phases  were  broken  Into  five  distinct  pieces:  training, 
three  testing  sessions,  and  a  follow-up  session.  All  groups  of  subjects  were  ex¬ 
posed  to  a  similar  amount  of  training  on  the  testing  techniques  before  the  study 
began.  As  mentioned  earlier,  the  University  of  Maryland  subjects  were  enrolled 
In  the  "Software  Design  and  Development"  course,  and  the  NASA/CSC  sub¬ 
jects  were  given  a  four-hour  tutorial.  Background  Information  on  the  subjects 


was  captured  through  a  questionnaire.  Elementary  exercises  followed  by  a 
pretest  covering  all  techniques  were  administered  to  all  subjects  after  the  train¬ 
ing  and  before  the  testing  sessions.  Reasonable  effort  on  the  part  of  the  Univer¬ 
sity  of  Maryland  subjects  was  enforced  by  their  being  graded  on  the  work  and 
by  their  needing  to  use  the  techniques  In  a  major  class  project.  Reasonable 
effort  on  the  part  of  the  NASA/CSC  subjects  was  certain  because  of  their  desire 
for  the  study's  outcome  to  Improve  their  software  testing  environment.  All  sub¬ 
jects  groups  were  judged  highly  motivated  during  the  study.  The  subjects  were 
all  familiar  with  the  editors,  terminals,  machines,  and  the  programs'  Implemen¬ 
tation  language. 

The  Individuals  were  requested  to  use  the  three  testing  techniques  to  the 
best  of  their  ability.  Every  subject  participated  In  all  three  testing  sessions  of 
his/her  phase,  using  all  techniques  but  each  on  a  separate  program.  The  Indivi¬ 
duals  using  code  reading  were  each  given  the  specification  for  the  program  and 
Its  source  code.  They  were  then  asked  to  apply  the  methods  of  code  reading  by 
stepwise  abstraction  to  detect  discrepancies  between  the  program’s  abstracted 
function  and  the  specification.  The  functional  testers  were  each  given  a 
specification  and  the  ability  to  execute  the  program.  They  were  asked  to  per¬ 
form  equivalence  partitioning  and  boundary  value  analysis  to  select  a  set  of  test 
data  for  the  program.  Then  they  executed  the  program  on  this  collection  of 
test  data,  and  Inconsistencies  between  what  the  program  actually  performed 
and  what  they  though  the  specification  said  It  should  perform  were  noted.  The 
structural  testers  were  given  the  source  code  for  the  program,  the  ability  to  exe- 


cute  It,  and  a  description  of  the  Input  format  for  the  program.  The  structural 
testers  were  asked  to  examine  the  source  and  generate  a  set  of  test  cases  that 
cumulatively  execute  100%  of  the  program's  statements.  When  the  subjects 
were  applying  an  on-line  technique,  they  generated  and  executed  their  own  test 
data;  no  test  data  sets  were  provided.  The  programs  were  Invoked  through  a 
test  driver  that  supported  the  use  the  of  multiple  Input  data  sets.  This  test 
driver,  unbeknown  to  the  subjects,  drained  off  the  Input  cases  submitted  to  the 
program  for  the  experimenter’s  later  analysis;  the  programs  could  only  be  ac¬ 
cessed  through  a  test  driver. 

A  structural  coverage  tool  calculated  the  actual  statement  coverage  of  the 
test  set  and  which  statements  were  left  unexecuted  for  the  structural  testers. 
.After  the  structural  testers  generated  a  collection  of  test  data  that  met  (or  al¬ 
most  met)  the  100%  coverage  criteria,  no  further  execution  of  the  program  or 
reference  to  the  source  code  was  allowed.  They  retained  the  program's  output 
from  the  test  cases  they  had  generated.  These  testers  were  then  provided  with 
the  program’s  specification.  Now  that  they  knew  what  the  program  was  Intend¬ 
ed  to  do,  they  were  asked  to  contrast  the  program’s  specification  with  the 
behavior  of  the  program  on  the  test  data  they  derived.  This  scenario  for  the 
structural  testers  was  necessary  so  that  “observed”  faults  could  be  compared. 

At  the  end  of  each  of  the  testing  sessions,  the  subjects  were  asked  to  give  a 
reasonable  estimate  of  the  amount  of  time  spent  detecting  faults  with  a  given 
testing  technique.  The  University  of  Maryland  subjects  were  assured  that  this 
had  nothing  to  with  the  grading  of  the  work.  There  seemed  to  be  little  Incen- 


tlve  for  the  subjects  In  any  of  the  groups  not  to  be  truthful.  At  the  completion 
of  each  testing  session,  the  NASA/CSC  subjects  were  also  asked  what  percen¬ 
tage  of  the  faults  In  the  program  that  they  thought  were  uncovered.  After  all 
three  testing  sessions  In  a  given  phase  were  completed,  the  subjects  were  re¬ 
quested  to  critique  and  evaluate  the  three  testing  techniques  regarding  their  un- 
derstandablllty,  naturalness,  and  effectiveness.  The  University  of  Maryland  sub¬ 
jects  submitted  a  written  critique,  while  a  two  hour  debriefing  forum  was  con¬ 
ducted  for  the  NASA/CSC  Individuals.  In  addition  to  obtaining  the  Impres¬ 
sions  of  the  Individuals,  these  follow-up  procedures  gave  an  understanding  of 
how  well  the  subjects  were  comprehending  and  applying  the  methods.  These 
final  sessions  also  afforded  the  participants  an  opportunity  to  comment  on  any 
particular  problems  they  had  with  the  techniques  or  In  applying  them  to  the 
given  programs. 

4.1.3.  Data  Analysis 

The  analysis  of  the  data  collected  from  the  various  phases  of  the  experl- 
rm  nt  Is  presented  according  to  the  goal  and  question  framework  discussed  ear¬ 
lier. 


4. 1.3.1.  Fault  Detection  Effectiveness 

The  first  goal  area  addresses  the  fault  detection  effectiveness  of  each  of  the 
techniques.  Figure  in  presents  a  summary  of  the  measures  that  were  examined 
to  pursue  this  goal  area.  A  brief  description  of  each  measure  Is  as  follows  -  (*) 
means  only  relevant  for  on-line  testing,  a)  #  Faults  detected  -  the  number  of 


o 


faults  detected  by  a  subject  applying  a  given  testing  technique  on  a  given  pro¬ 
gram.  b)  %  Faults  detected  -  the  percentage  of  a  program’s  faults  that  a  sub¬ 
ject  detected  by  applying  a  testing  technique  to  the  program,  c)  #  Faults  ob¬ 
servable  (*)  -  the  number  of  faults  that  were  observable  from  the  program’s 
behavior  given  the  Input  data  submitted,  d)  %  Faults  observable  (*)  -  the  per¬ 
centage  of  a  program's  faults  that  were  observable  from  the  program’s  behavior 
given  the  Input  data  submitted,  e)  %  Detected/observable  (*)  -  the  percentage 
of  faults  observable  from  the  program's  behavior  on  the  given  Input  set  that 
were  actually  observed  by  a  subject,  f)  %  Faults  felt  found  -  a  subject's  esti¬ 
mate  of  the  percentage  of  a  program’s  faults  that  he/she  thought  were  detected 
by  his/her  testing,  g)  Maximum  statement  coverage  (*)  -  the  maximum  percen¬ 
tage  of  a  program’s  statements  that  were  executed  in  a  set  of  test  cases. 

4. 1.3. 1.1.  Data  Distributions 

The  actual  distribution  of  the  number  of  faults  observed  by  the  subjects 
appears  In  Figure  13,  broken  down  by  phase.  From  Figures  12  and  13,  the  large 
variation  In  performance  among  the  subjects  Is  clearly  seen.  The  mean  number 
of  faults  detected  by  the  subjects  Is  displayed  In  Figure  14,  broken  down  by 
technique,  program,  expertise  level,  and  phase. 


Figure  12. 

Overall  summary  of  detection  effectiveness  data. 

Note:  some  data  pertain  to  only  on-line  techniques  (*), 
data  were  collected  only  In  certain  phases. 


Measure 


:  Faults  detected 


%  Faults  detected 


Faults  observable 


%  Faults  observable 


%  Detected/observable 


and  some 


Phase 


29 


29 


29 


29(*) 


29(*) 


Mean 

SD 

Min. 

Max. 

3.94 

1.82 

0.00 

7.00 

54.78 

25.11 

0.00 

100.00 

5.38 

1.51 

3.00 

8.00 

74.59 

20.54 

33.33 

100.00 

70.99 

24.01 

0.00 

100.00 

2 

13 

#  Faults  detected 

3.28 

1.96 

0.00 

7.00 

2 

13 

%  Faults  detected 

39.53 

27.25 

0.00 

100.00 

3 

32 

#  Faults  detected 

4.27 

1.86 

0.00 

8.00 

3 

32 

%  Faults  detected 

49.82 

27.44 

0.00 

100.00 

3 

32 

%  Faults  felt  found 

75.10 

24.07 

0.00 

100.00 

3 

32(*) 

#  Faults  observable 

5.61 

1.52 

3.00 

9.00 

3 

wmm 

%  Faults  observable 

62.11 

18.38 

25.00 

100.00 

3 

iBisi 

%  Detected/observable 

69.67 

27.14 

0.00 

100.00 

3 

Max.  %  stmt,  covered 

97.02 

7.83 

46.00 

100.00 

Ave 

74 

#  Faults  detected 

3.97 

1.88 

0.00 

8.00 

Ave 

74 

%  Faults  detected 

49.96 

27.29 

0.00 

100.00 

Ave 

ei(*) 

#  Faults  observable 

5.5 

1.5 

3.00 

9.00 

Ave 

oi(*) 

%  Faults  observable 

68.0 

20.3 

25.0 

100.0 

Ave 

61(*) 

%  Detected/observable 

70.3 

25.6 

0.0 

100.0 

Figure  14. 

Overall  summary  for  number  of  faults  detected. 

Phase 

1 

2 

3 

Effect 

Level 

Mean(SD) 

Mean(SD) 

Technique 

Reading 

Functional 

4.45  (1.70) 

3.77  (1.83) 

4.47  (1.34) 

Structural 

3.28  (1.87) 

3.08  (1.89) 

3.25  (1.80) 

Program 

Formatter 

4.07  (1.82) 

3.23  (2.20) 

4.19  (1.73) 

Plotter 

3.48  (1.45) 

3.31  (1.97) 

•  (•) 

Data  type 

KK1H4M 

•(•) 

5.22  (1.75) 

Database 

.  (.) 

3.31  (1.84) 

3.41  (1.68) 

Expertise 

Junior 

3.88  (1.89) 

3.04  (2.07) 

Intermed. 

4.07  (1.69) 

3.83  (1.64) 

Advanced 

•(•) 

•  (•) 

5.00  (1.53) 

4. 1.3. 1.2.  Number  of  Faults  Detected 

The  first  question  under  this  goal  area  asks  which  of  the  testing  techniques 
detected  the  most  faults  In  the  programs.  The  overall  F-test  of  the  techniques 
detecting  an  equal  number  of  faults  In  the  programs  Is  rejected  In  the  first  and 
third  phases  of  the  study  (a <.024  and  a <.0001,  respectively;  not  rejected  In 
phase  two,  a> . 05).  Recall  that  the  phase  three  data  was  collected  from  32 
NASA/CSC  subjects,  and  the  phase  one  data  was  from  29  University  of  Mary¬ 
land  subjects.  With  the  phase  three  data,  the  contrast  of  '‘reading  -05* 
(functional  +  structural)”  estimates  that  the  technique  of  code  reading  by  step¬ 
wise  abstraction  detected  1.24  more  faults  per  program  than  did  either  of  the 


59 


other  techniques  (aC.0001,  c.l.  0.73  -  1.75).4  Note  that  code  reading  performed 
well  even  though  the  professional  subjects’  primary  experience  was  with  func¬ 


tional  testing.  Also  with  the  phase  three  data,  the  contrast  of  ‘‘functional  - 
structural"  estimates  that  the  technique  of  functional  testing  detected  1.11  more 
faults  per  program  than  did  structural  testing  ( a<.0007 ,  c.l.  0.52  -  1.70).  In 
the  phase  one  data,  the  contrast  of  “0.5  *  (reading  -f  functional)  -  structural” 
estimates  that  the  technique  of  structural  testing  detected  1.00  fault  less  per 
program  than  did  either  reading  or  functional  testing  (a<.0065,  c.l.  0.31  - 
1.09).  In  the  phase  one  data,  the  contrast  of  "reading  -  functional”  was  not 
statistically  different  from  zero  (a >.05).  The  poor  performance  of  structural 
testing  across  the  phases  suggests  the  Inadequacy  of  using  statement  coverage 
criteria.  The  above  pairs  of  contrasts  were  chosen  because  they  are  linearly  in¬ 
dependent. 


4. 1.3. 1.3.  Percentage  of  Faults  Detected 

Since  the  programs  tested  each  had  a  different  number  of  faults,  a  question 
In  the  earlier  goal/questlon  framework  asks  which  technique  detected  the 
greatest  percentage  of  faults  In  the  programs.  The  order  of  performance  of  the 
techniques  Is  the  same  as  above  when  the  percentage  of  the  programs'  faults 
detected  are  compared.  The  overall  F-tests  for  phases  one  and  three  were  re¬ 
jected  as  before  (a <.037  and  a <.0001  respectively;  not  rejected  In  phase  two. 


4  The  probably  of  Type  I  error  Is  reported,  the  probability  of  erroneously  re¬ 
jecting  the  null  hypothesis.  The  abbreviation  "c.l."  stands  for  Qb%  confidence 
Interval. 


60 


a>.05).  Applying  the  same  contrasts  as  above:  a)  In  phase  three,  reading 


detected  16.0%  more  faults  per  program  than  did  the  other  techniques 
(aC.OOOl,  c.l.  9.9  -  22.1),  and  functional  detected  11.2%  more  faults  than  did 
structural  (a <.003,  c.l.  4.1  -  18.3);  b)  In  phase  one,  structural  detected  13.2% 
fewer  of  a  program's  faults  than  did  the  other  methods  (a<.Oll,  c.l.  3.5  -  22.9), 
and  reading  and  functional  were  not  statistically  different  as  before. 

4. 1.3. 1.4.  Dependence  on  Software  Type 

Another  question  In  this  goal  area  queries  whether  the  number  or  percen¬ 
tage  of  faults  detected  depends  on  the  program  being  tested.  The  overall  F-test 
that  the  number  of  faults  detected  Is  not  program  dependent  Is  rejected  only  In 
the  phase  three  data  (a<.0001).  Applying  Tukey’s  multiple  comparison  on  the 
phase  three  data  reveals  that  the  most  faults  were  detected  In  the  abstract  data 
type,  the  second  most  In  the  text  formatter,  and  the  least  number  of  faults  were 
found  In  the  database  malntalner  (simultaneous  c*<.05).  When  the  percentage 
of  faults  found  In  a  program  Is  considered,  however,  the  overall  F-tests  for  the 
three  phases  are  all  rejected  (a<  027,  a<  01,  and  a<.0001  In  respective  ord¬ 
er).  Tukey’s  multiple  comparison  yields  the  following  orderings  on  the  pro¬ 
grams  (all  simultaneous  a<.05).  'In  the  phase  one  data,  the  ordering  was  (data 
type  ca  plotter)  >  text  formatter;  that  Is,  a  higher  percentage  of  faults  were 
detected  In  either  the  abstract  data  type  or  the  plotter  than  were  found  In  the 
text,  formatter:  there  was  no  difference  between  the  abstract  data  type  and  the 
plotter  In  the  percentage  found.  In  the  phase  two  data,  the  ordering  of  pereen- 


6 


tage  of  faults  detected  was  plotter  >  (text  formatter  database  malntalner). 
In  the  phase  three  data,  the  ordering  of  percentage  of  faults  found  In  the  pro¬ 
grams  was  the  same  as  the  number  of  faults  found,  abstract  data  type  >  text 
formatter  >  database  malntalner.  Summarizing  the  effect  of  the  type  of  soft¬ 
ware  on  the  percentage  of  faults  observed:  1)  the  programs  with  the  highest  per¬ 
centage  of  their  faults  detected  were  the  abstract  data  type  and  the  mathemati¬ 
cal  plotter,  the  percentage  detected  between  these  two  was  not  statistically 
different;  2)  the  programs  with  the  lowest  percentage  of  their  faults  detected 
were  the  text  formatter  and  the  database  malntalner;  the  percentage  detected 
between  these  two  was  not  statistically  different  In  the  phase  two  data,  but  a 
higher  percentage  of  faults  In  the  text  formatter  was  detected  In  the  phase  three 
data. 

4. 1.3. 1.5.  Observable  vs.  Observed  Faults 

One  evaluation  criteria  of  the  success  of  a  software  testing  session  Is  the 
number  of  faults  detected.  An  evaluation  criteria  of  the  particular  test  data 
generated,  however.  Is  the  ability  of  the  test  data  to  reveal  faults  In  the  pro¬ 
gram.  A  test  data  set's  ability  to  uncover  faults  In  a  program  can  be  measured 
by  the  number  or  percentage  of  a  program's  faults  that  are  made  observable 
from  execution  on  that  Input.  Distinguishing  the  faults  observable  In  a  program 
from  the  faults  actually  observed  by  a  tester  highlights  the  differences  In  the  ac¬ 
tivities  of  test  data  generation  and  program  behavior  examination.  .As  shown  In 
Figure  11.  the  average  number  of  the  programs'  faults  observable  was  SS.OF: 


when  Individuals  were  either  functional  testing  or  structurally  testing.  Of 
course,  with  a  nonexecutlon-based  technique  such  as  code  reading,  100%  of  the 
faults  are  observable.  Test  data  generated  by  subjects  using  the  technique  of 
functional  testing  resulted  in  1.4  more  observable  faults  (£*<.0002,  c.l.  0.79  - 
2.01)  than  did  the  use  of  structural  testing  In  phase  one  of  the  study;  the  per¬ 
centage  difference  of  functional  over  structural  was  estimated  at  20.0% 
(£*<.0002,  c.l.  11.2  -  28.8).  The  techniques  did  not  differ  In  these  two  measures 
In  the  third  phase  of  the  study.  However,  Just  considering  the  faults  that  were 
observable  from  the  submitted  test  data,  functional  testers  detected  18.5%  more 
of  these  observable  faults  than  did  structural  testers  In  the  phase  three  data 
(£*<.0016,  c.l.  8.9  -  28.1);  they  did  not  differ  In  the  phase  one  data.  Note  that 
all  faults  In  the  programs  could  be  observed  In  the  programs'  output  given  the 
proper  Input  data.  When  using  the  on-line  techniques  of  functional  and  struc¬ 
tural  testing,  subjects  detected  70.3%  of  the  faults  observable  In  the  program’s 
output.  In  order  to  conduct  a  successful  testing  session,  faults  In  a  program 
must  be  both  revealed  and  subsequently  observed. 

4. 1.3. 1.6.  Dependence  on  Program  Coverage 

Another  measure  of  the  ability  of  a  test  set  to  reveal  a  program's  faults  Is 
the  percentage  of  a  program's  statements  that  are  executed  by  the  test  set.  The 
average  maximum  statement  coverage  achieved  by  the  functional  and  structural 
testers  was  97.0%.  The  maximum  statement  coverage  from  the  submitted  test 
data  was  not  statistically  different  between  the  functional  and  structural  testers 


(a>.05).  Also,  there  was  no  correlation  between  maximum  statement  coverage 
achieved  and  either  number  or  percentage  of  faults  found  (a >.05). 

4.1. 3.1.7.  Dependence  on  Programmer  Expertise 

A  final  question  In  this  goal  area  concerns  the  contribution  of  programmer 
expertise  to  fault  detection  effectiveness.  In  the  phase  three  data  from  the 
NASA/CSC  professional  environment,  subjects  of  advanced  expertise  detected 
more  faults  than  did  either  the  subjects  of  Intermediate  or  Junior  expertise 
(a<,05).  When  the  percentage  of  faults  detected  Is  compared,  however,  the  ad¬ 
vanced  subjects  performed  better  than  the  Junior  subjects  (a<.05),  but  were 
not  statistically  different  from  the  Intermediate  subjects  ((*>.05).  The  Inter¬ 
mediate  and  Junior  subjects  were  not  statistically  different  In  any  of  the  three 
phases  of  the  study  In  terms  of  number  or  percentage  faults  observed.  When 
several  subject  background  attributes  were  correlated  with  the  number  of  faults 
found,  total  years  of  professional  experience  had  a  minor  relationship  (Pearson 
R  =  .22,  a<.05).  Correspondence  of  performance  with  background  aspects  was 
examined  across  all  observations,  and  within  each  of  the  phases.  Including  previ¬ 
ous  academic  performance  for  the  University  of  Maryland  subjects.  Other  than 
the  above,  no  relationships  were  found. 

4. 1.3. 1.8.  Accuracy  of  Self-Estimates 

Recall  that  the  NASA/CSC  subjects  In  the  phase  three  data  estimated,  at 
the  completion  of  a  testing  session,  the  percentage  of  a  program's  faults  they 
thought  they  had  uncovered.  This  estimation  of  the  number  of  faults  un- 


covered  correlated  reasonably  well  with  the  actual  percentage  of  faults  detected 
(R  =  .57,  a<.000l).  Investigating  further.  Individuals  using  the  different  tech¬ 
niques  were  able  to  give  better  estimates:  code  readers  gave  the  best  estimates 
(R  =  .79,  a <.0001),  structural  testers  gave  the  second  best  estimates  (R  =  .57, 
a<.0007),  and  functional  testers  gave  the  worst  estimates  (no  correlation, 
a >.05).  This  last  observation  suggests  that  the  code  readers  were  more  certain 
of  the  effectiveness  they  had  In  revealing  faults  In  the  programs. 

4. 1.3. 1.9.  Dependence  on  Interactions 

There  were  few  significant  Interactions  between  the  main  effects  of  testing 
technique,  program,  and  expertise  level.  In  the  phase  two  data,  there  was  an  In¬ 
teraction  between  testing  technique  and  program  In  both  the  number  and  per¬ 
centage  of  faults  found  (a<.0013,  <a<.0014  respectively).  The  effectiveness  of 
code  reading  Increased  on  the  text  formatter.  In  the  phase  three  data,  there 
was  a  slight  three-way  Interaction  between  testing  technique,  program,  and  ex¬ 
pertise  level  for  both  the  number  and  percentage  of  faults  found  (a<.05,  a <  04 
respectively). 

4.1.3.1.10.  Summary  of  Fault  Detection  Effectiveness 

Summarizing  the  major  results  or  the  comparison  of  fault  detection 
effectiveness:  1)  In  the  phase  three  data,  code  reading  detected  a  greater  number 
and  percentage  of  faults  than  the  other  methods,  with  functional  detecting  more 
than  structural:  2)  In  the  phase  one  data,  code  reading  and  functional  were 
equally  effective,  while  structural  was  Inferior  to  both  -  there  were  no  differences 


among  the  three  techniques  In  phase  two;  3)  the  number  of  faults  observed 
depends  on  the  type  of  software:  the  most  faults  were  detected  In  the  abstract 
data  type  and  the  mathematical  plotter,  the  second  most  In  the  text  formatter, 
and  (In  the  case  of  the  phase  three  data)  the  least  were  found  In  the  database 
malntalner;  4)  functionally  generated  test  data  revealed  more  observable  faults 
than  did  structurally  generated  test  data  In  phase  one,  but  not  In  phase  three; 
5)  subjects  of  Intermediate  and  Junior  expertise  were  equally  effective  In  detect¬ 
ing  faults,  while  advanced  subjects  found  a  greater  number  of  faults  than  did  ei¬ 
ther  group;  and  6)  self-estimates  of  faults  detected  were  most  accurate  from 
subjects  applying  code  reading,  followed  by  those  doing  structural  testing,  with 
estimates  from  persons  functionally  testing  having  no  relationship. 

4. 1.3. 2.  Fault  Detection  Cost 

The  second  goal  area  examines  the  fault  detection  cost  of  each  of  the  tech¬ 
niques.  Figure  15  presents  a  summary  of  the  measures  that  were  examined  to 
Investigate  this  goal  area.  A  brief  description  of  each  measure  Is  as  follows  -  (*) 
means  only  relevant  for  on-line  testing,  a)  #  Faults  /  hour  -  the  number  of 
faults  detected  by  a  subject  applying  a  given  technique  normalized  by  the  effort 
In  hours  required,  called  the  fault  detection  rate,  b)  Detection  time  -  the  total 
number  of  hours  that  a  subject  spent  in  testing  a  program  using  a  technique,  c) 
Cpu-tlme  (*)  -  the  cpu-tlme  In  seconds  used  during  the  testing  session,  d)  Nor¬ 
malized  cpu-tlme  (*)  -  the  cpu-tlme  In  seconds  used  during  the  testing  session. 


66 


normalized  by  a  factor  for  machine  speed,  e)  Connect  time  (*)  -  the  number  of 
minutes  that  a  Individual  spent  on-line  while  testing  a  program,  f)  #  Program 
runs  (*)  -  the  number  of  executions  of  the  program  test  driver;  note  that  the 
driver  supported  multiple  sets  of  Input  data.  All  of  the  on-line  statistics  were 
monitored  by  the  operating  systems  of  the  machines. 


4. 1.3. 2.1.  Data  Distributions 


The  actual  distribution  of  the  fault  detection  rates  for  the  subjects  appears 
In  Figure  10,  broken  down  by  phase.  Once  again,  note  the  many-to-one 
differential  In  subject  performance.  Figure  17  displays  the  mean  fault  detection 
rate  for  the  subjects,  broken  down  by  technique,  program,  expertise  level,  and 
phase. 


■’  In  the  phase  three  data,  testing  was  done  on  both  a  VAX  11/7-SO  and  an 
IBM  4341.  .As  suggested  by  benchmark  comparisons  [Church  S4[.  the  VAX 
epu-tlmes  were  divided  by  1.6  and  the  IBM  cpu-tlmes  were  divided  bv  O.d. 


Figure  15. 

Overall  summary  of  fault  detection  cost  data. 

Note:  some  data  pertain  to  only  on-line  techniques  (*), 
data  were  collected  only  in  certain  phases. 


and  some 


Phase 

Wi&SSm 

Measure 

Mean 

SD 

Min. 

Max. 

1 

29 

#  Faults  /  hour 

1.63 

1.28 

0.00 

7.00 

1 

29 

Detection  time  (hrs) 

3.33 

2.09 

0.75 

10.00 

2 

13 

2 

13 

3 

32 

3 

32 

3 

■sm 

3 

32(*) 

3 

32(*) 

3 

32f*) 

Faults  /  hour 


Detection  time  (hrs' 


#  Faults  /  hour 


Detection  time  (hrs 


Cpu-tlme  (sec)  _ 


Cpu-tlme  (sec:  norm.) 


Connect  time  (min' 


^  program  runs 


•  Faults  /  hour 


Detection  time  (hrs) 


0.99 

0.81 

0.00 

3.00 

4.70 

3.02 

1.00 

14.00 

2.33 

2.28 

0.00 

14.00 

2.75 

0.50 

7.25 

45.2 

56.1 

3.0 

283.0 

38.5 

51.7 

2.9 

314.4 

65.83 

50.21 

3.50 

214.00 

5.45 

5.00 

1.00 

24.00 

1.82 

1.80 

0.00 

14.00 

3.32 

2.19 

0.50 

14.00 

Figure  16.  Distribution  of  the  fault  detection  rate  (^faults  detected  per  hour)  broken  down 
by  phase.  Key:  code  readers  (C),  functional  testers  (F),  and  structural  testers  (Si. 


S 

S 

S  Phase  1  :  Phase  3 

SS  87  observations  96  observations 


SS 
SS 
SS 
SS 
SS 
FSS 
FFS 
FFS 
FFSS 
FFSS 
FFFS 
CFFF 
CFFF 
OCFF  S 
FOCCF  F 
COOCF  F  S 
COCCCCFSSF 
SCOOCCCFFCC  CCC  C 

0  5  10  1 


S 

S 

SS 
SS 
SF 
SF 
SFS 
SFS 
FFS  S 
FFS  S 
SFFF  S 
SFCF  F 
SFCF  F  S 
SSFCFSF  S 
SFCCCSFS  S 
FFCCCFFF  S 

SCFCCCFCC  S  C 

CCCCOCFCCCF  CFC  C  C  CC  C 


S 

F 

F 

SF 

SF  S 
FF  S 
CC  S 
CC  S 
SCCSFF 
FOCSCFS  S 
OOCSCFF  F 


Phase  2 

39  observations 


0 


5 


10 


IS 


In  the  phase  two  and  three  data;  the  overall  F-test  for  the  phase  one  data  was 
rejected  (a<.013).  In  the  phase  one  data,  structural  testers  spent  an  estimated 
1.08  hours  less  testing  than  did  the  other  techniques  (a<.004,  c.l.  0.39  -  1.78), 
while  code  readers  were  not  statistically  different  from  functional  testers.  Recall 
that  In  phase  one,  the  structural  testers  observed  both  a  lower  number  and  per¬ 
centage  of  the  programs’  faults  than  did  the  other  techniques. 

4. 1.3.2. 3.  Dependence  on  Software  Type 

Another  question  In  this  area  focuses  on  how  fault  detection  rate  depends 
on  software  type.  The  overall  F-test  that  the  detection  rate  Is  the  same  for  the 
programs  Is  rejected  In  the  phase  one  and  phase  three  data  (o!<.0l  and 
a <.0001  respectively);  the  detection  rate  among  the  programs  was  not  statisti¬ 
cally  different  In  phase  two.  Applying  Tukey's  multiple  comparisons  on  the 
phase  one  data  finds  that  the  fault  detection  rate  was  greater  on  the  abstract 
data  type  than  on  the  plotter,  while  there  was  no  difference  either  between  the 
abstract  data  type  and  the  text  formatter  or  between  the  text  formatter  and  the 
plotter  (simultaneous  a<05).  In  the  phase  three  data,  the  fault  detection  rate 
was  higher  In  the  abstract  data  type  than  It  was  for  the  text  formatter  aid  the 
database  malntalner,  with  the  text  formatter  and  the  database  maintains  not 
being  statistically  different  (simultaneous  a<05).  The  overall  effort  spent  In 
fault  detection  was  different  among  the  programs  In  phases  one  and  three 
(a <  012  and  a <.0001  respectively),  while  there  was  no  difference  In  phase  two. 


In  phase  one.  more  effort  was  spent  testing  the  plotter  than  the  abstract  data 


type,  while  there  was  no  statistical  difference  either  between  the  plotter  and  the 
text  formatter  or  between  the  text  formatter  and  the  abstract  data  type  (simul¬ 
taneous  a<.05).  In  phase  three,  more  time  was  spent  testing  the  database 
malntalner  than  was  spent  on  either  the  text  formatter  or  on  the  abstract  data 
type,  with  the  text  formatter  not  differing  from  the  abstract  data  type  (simul¬ 
taneous  a<.05).  Summarizing  the  dependence  of  fault  detection  cost  on  soft¬ 
ware  type,  l)  the  abstract  data  type  had  a  higher  detection  rate  and  less  total 
detection  effort  than  did  either  the  plotter  or  the  database  malntalner,  the  latter 
two  were  not  different  In  either  detection  rate  or  total  detection  time;  2)  the 
text  formatter  and  the  plotter  did  not  differ  In  fault  detection  rate  or  total 
detection  effort;  3)  the  text  formatter  and  the  database  malntalner  did  not  differ 
In  fault  detection  rate  overall  and  did  not  differ  In  total  detection  effort  in  phase 
two,  but  the  database  malntalner  had  a  higher  total  detection  effort  In  phase 
three;  4)  the  text  formatter  and  the  abstract  data  type  did  not  differ  In  total 
detection  effort  overall  and  did  not  differ  In  fault  detection  rate  In  phase  one, 
but  the  abstract  data  type  had  a  higher  detection  rate  In  phase  three. 

4. 1.3. 2. 4.  Computer  Costs 

In  addition  to  the  effort  spent  by  Individuals  In  software  testing,  on-line 
methods  Incur  machine  costs.  The  machine  cost  measures  of  cpu-tlme,  connect 
time,  and  the  number  of  runs  were  compared  across  the  on-line  techniques  of 
functional  and  structural  testing  in  phase  three  of  the  study.  A  nonexecution- 
based  technique  such  as  code  reading,  of  course.  Incurs  no  machine  time  costs. 


When  the  machine  speeds  are  normalized  (see  measure  definitions  above),  the 
technique  of  functional  testing  used  28.0  more  seconds  of  cpu-tlme  than  did  the 
technique  of  structural  testing  (a<.0i6,  c.l.  7.0  -  45.0).  The  estimate  of  the 
difference  Is  29.8  seconds  when  the  cpu-tlmes  are  not  normalized  (a<.012,  c.l. 
9.0  -  50.2).  Individuals  using  functional  testing  used  28.4  more  minutes  of  con¬ 
nect  time  than  did  those  using  structural  testing  (a<.004,  c.l.  11.7  -  45.1).  The 
number  of  computer  runs  of  a  program's  test  driver  was  not  different  between 
the  two  techniques  (o:>.05).  These  results  suggest  that  Individuals  using  func¬ 
tional  testing  spent  more  time  on-line  and  used  more  cpu-tlme  per  computer  run 
than  did  those  structurally  testing. 

4. 1.3.2. 5.  Dependence  on  Programmer  Expertise 

The  relation  of  programmer  expertise  to  cost  of  fault  detection  Is  another 
question  In  this  goal  section.  The  expertise  level  of  the  subjects  had  no  relation 
to  the  fault  detection  rate  In  phases  two  and  three  (a >.05  for  both  F-tests). 
Recall  that  phase  three  of  the  study  used  32  professional  subjects  with  all  three 
levels  of  computer  science  expertise.  In  phase  one,  however,  the  Intermediate 
subjects  detected  faults  at  a  faster  rate  than  did  the  Junior  subjects  (a<.005). 
The  total  effort  spent  In  fault  detection  was  not  different  among  the  expertise 
levels  In  any  of  the  phases  (a >.05  for  all  three  F-tests).  When  all  74  subjects 
are  considered,  years  of  professional  experience  correlates  positively  with  fault 
detection  rate  (R  =  .41,  a  <  0002)  and  correlates  slightly  negatively  with  total 
detection  time  (R  =  -.25,  a<03).  These  last  two  observations  suggest  that 


persons  with  more  years  of  professional  experience  detected  the  faults  faster  and 
spent  less  total  time  doing  so.  Several  other  subject  background  measures 
showed  no  relationship  with  fault  detection  rate  or  total  detection  time 
(a<.05).  Background  measures  were  examined  across  all  subjects  and  within 
the  groups  of  NASA/CSC  subjects  and  University  of  Maryland  subjects. 

4. 1.3. 2.0.  Dependence  on  Interactions 

There  were  few  significant  Interactions  between  the  main  effects  of  testing 
technique,  program,  and  expertise  level.  There  was  an  Interaction  between  test¬ 
ing  technique  and  software  type  In  terms  of  fault  detection  rate  and  total  detec¬ 
tion  cost  for  the  phase  three  data  (a<.003  and  a<.007  respectively).  Subjects 
using  code  reading  on  the  abstract  data  type  had  an  Increased  fault  detection 
rate  and  a  decreased  total  detection  time. 

4. 1.3. 2. 7.  Relationships  Between  Fault  Detection  Effectiveness  and 
Cost 

There  were  several  correlations  between  fault  detection  cost  measures  and 
performance  measures.  Fault  detection  rate  correlated  overall  with  number  of 
faults  detected  (R  =  .-48,  aC.OCOl),  percentage  of  faults  found  (R  =  .48. 
q  <  .0001 ),  and  total  detection  time  (R  ==  -.53,  aC.OOOl),  but  not  with  normal¬ 
ized  cpu-tlme,  raw  cpu-tlme.  connect  time,  or  number  of  computer  runs 
(a>,05).  Total  detection  time  correlated  with  normalized  cpu-tlme  (R  =  .36, 
a<.04)  and  raw  cpu-tlme  (R  =  .37,  a<.04),  but  not  with  connect  time, 
number  of  runs,  number  of  faults  detected,  or  percentage  of  faults  detected. 


74 


The  number  of  faults  detected  In  the  programs  correlated  with  the  amount  of 
machine  resources  used:  normalized  cpu-tlme  (R  =  .47,  a<.007),  raw  cpu-tlme 
(R  =  .52,  a<.002),  and  connect  time  (R  =  .49,  a<.003),  but  not  with  the 
number  of  computer  runs  (a>.05).  The  correlations  for  percentage  of  faults 
detected  with  machine  resources  used  were  similar.  Although  most  of  these 
correlations  are  minor,  they  suggest  that  i)  the  higher  the  fault  detection  rate, 
the  more  faults  found  and  the  less  time  spent  In  fault  detection;  2)  fault  detec¬ 
tion  rate  had  no  relationship  with  use  of  machine  resources;  3)  spending  more 
time  In  detecting  faults  had  no  relationship  with  the  amount  of  faults  detected; 
and  4)  the  more  cpu-tlme  and  connect  time  used,  the  more  faults  found. 

4. 1.3. 2. 8.  Summary  of  Fault  Detection  Cost 

Summarizing  the  major  results  of  the  comparison  of  fault  detection  cost:  l) 
In  the  phase  three  data,  code  reading  had  a  higher  fault  detection  rate  than  the 
other  methods,  with  no  difference  between  functional  testing  and  structural  test¬ 
ing;  2)  In  the  phase  one  and  two  data,  the  three  techniques  were  not  different  In 
fault  detection  rate;  3)  In  the  phase  two  and  three  data,  total  detection  effort 
was  not  different  among  the  techniques,  but  In  phase  one  less  effort  was  spent 
for  structural  testing  than  for  the  other  techniques,  while  reading  and  functional 
were  not  different;  4)  fault  detection  rate  and  total  effort  In  detection  depended 
on  the  type  of  software:  the  abstract  data  type  had  the  highest  detection  rate 
and  lowest  total  detection  effort,  the  plotter  and  the  database  malntalner  had 


the  lowest  detection  rate  and  the  highest  total  detection  effort,  and  the  text  for- 


matter  was  somewhere  In  between  depending  on  the  phase;  5)  functional  testing 
used  more  cpu-tlme  and  connect  time  than  did  structural  testing,  but  they  were 
not  different  In  the  number  of  runs;  6)  In  phases  two  and  three,  subjects  across 
expertise  levels  were  not  different  In  fault  detection  rate  or  total  detection  time, 
In  phase  one  Intermediate  subjects  had  a  higher  detection  rate;  and  7)  there  was 
a  moderate  correlation  between  fault  detection  rate  and  years  of  professional  ex¬ 
perience  across  all  subjects. 

4. 1.3.3.  Characterization  of  Faults  Detected 

The  third  goal  area  focuses  on  determining  what  classes  of  faults  are 
detected  by  the  different  techniques.  In  the  earlier  section  on  the  faults  In  the 
software,  the  faults  were  characterized  by  two  different  classification  schemes: 
omission  or  commission,  and  Initialization,  control,  data,  computation,  Interface, 
or  cosmetic.  The  faults  detected  across  all  three  study  phases  are  broken  down 
by  the  two  fault  classification  schemes  In  Figure  18.  The  entries  In  the  figure 
are  the  average  percentage  (with  standard  deviations)  of  faults  In  a  given  class 
observed  when  a  particular  technique  was  being  used.  Note  that  when  a  subject 
tested  a  program  that  had  no  faults  In  a  given  class,  he/she  was  excluded  from 
the  calculation  of  this  average. 


Figure  18.  Characterization  of  the  faults  detected. 

Code 

Reading 

Functional 

Testing 

Structural 

Testing 

Overall 

Omission 

KEJEWCTEIB 

52.0  (41.3) 

Commission 

fffKVRSfll 

44.3  (26.6) 

50.7  (28.4) 

Total 

54.1  129.2) 

54.8  (24.5) 

41.2  (28.1) 

50.0  (27.3) 

Initial. 

64.6  (40.3) 

75.0  (36.1) 

46.2  (39.8) 

61.5  (40.2) 

Control 

42.8  (36.6) 

86.7  (34.9) 

52.8  (37.2) 

Data 

20.7  (36.6) 

28.3  (44.9) 

26.8  (41.9) 

25.3  (41.0) 

Computat. 

70.9  (37.0) 

■m&iui 

64.6  (40.6) 

Interface 

30.7  (33.5) 

34.1  (35.1) 

Cosmetic 

8.3  (28.2) 

7.7  (27.2) 

10.8  (31.3) 

Total 

54.1  (29.2) 

41.2  (26.1) 

4. 1.3.3. 1.  Omission  vs.  Commission  Classification 

When  the  faults  are  partitioned  according  to  the  omlsslon/commlsslon 
scheme,  there  Is  a  distinction  among  the  techniques.  Both  code  readers  and 
functional  testers  observed  more  omission  faults  than  did  structural  testers 
(aC.OOl),  with  code  readers  and  functional  testers  not  being  different,  (a>  05). 
Since  a  fault  of  omission  occurs  as  a  result  of  some  segment  of  code  being  left 
out,  you  would  not  expect  structurally  generated  test  data  to  find  such  faults, 
In  fact,  44%  of  the  subjects  applying  structural  testing  found  zero  faults  of  om¬ 
ission  when  testing  a  program. 

4. 1.3.3. 2.  Six-Part  Fault  Classification 

When  the  faults  are  divided  according  to  the  second  fault,  classification 
scheme,  several  differences  are  apparent.  Both  code  reading  and  functional  test¬ 


ing  found  more  Initialization  faults  than  did  structural  testing  (a<.05).  with 


code  reading  and  functional  testing  not  being  different  (a >.05).  Code  reading 
detected  more  Interface  faults  than  did  either  of  the  other  methods  (a<.0l), 
with  no  difference  between  functional  and  structural  testing  (a>.05).  This  sug¬ 
gests  that  the  code  reading  process  of  abstracting  and  composing  program  func¬ 
tions  across  modules  must  be  an  effective  technique  for  finding  Interface  faults. 
Functional  testing  detected  more  control  faults  than  did  either  of  the  other 
methods  (a<.01),  with  code  reading  and  structural  testing  not  being  different 
(q>.05).  Recall  that  the  structural  test  data  generation  criteria  examined  Is 
based  on  determining  the  execution  paths  In  a  program  and  deriving  test  data 
that  execute  100%  of  the  program's  statements.  One  would  expect  that  more 
control  path  faults  would  be  found  by  such  a  technique.  However,  structural 
testing  did  not  do  as  well  as  functional  testing  In  this  fault  class.  The  technique 
of  code  reading  found  more  computation  faults  than  did  structural  testing 
(a<.05),  with  functional  testing  not  being  different  from  either  of  the  other  two 
methods  (a>.05).  The  three  techniques  were  not  statistically  different  In  the 
percentage  of  faults  they  detected  In  either  the  data  or  cosmetic  fault  classes 
(a >.05  for  both). 

4. 1.3. 3. 3.  Observable  Fault  Classification 

Figure  19  displays  the  average  percentage  (with  standard  deviations)  of 
faults  from  each  class  that  were  observable  from  the  test  data  submitted,  yet 
were  not  reported  by  the  tester.6  The  two  on-line  techniques  of  functional  and 

6  The  standard  deviations  presented  In  the  figure  are  high  because  of  the 
several  Instances  In  which  all  observable  faults  were  reported. 


structural  testing  were  not  different  In  any  of  the  faults  classes  (a>.05).  Note 
that  there  was  only  one  fault  In  the  cosmetic  class. 


Figure  19.  Characterization  of  the  faults  observable,  but  not  reported. 

Functional 

Testing 

Structural 

Testing 

Overall 

Omission 

WKBBRSiEtM 

18.5  (28.8) 

Commission 

BIBH 

BPTTi  HTTM 

19.0  (18.3) 

Total 

18.1  (17.8) 

■EEERE1H 

19.0  (17.3) 

Initial. 

■EEKESM 

9.8  (25.5) 

Control 

mmmrn 

20.7  (30.8) 

Data 

28.6  (43.5) 

7.5  (24.5) 

18.3  (30.7) 

Computat. 

10.0  (31.3) 

18.0  (34.5) 

Interface 

10.1  (20.0) 

18.2  (20.8) 

Cosmetic 

73.2  (44.9) 

Total 

M\  WlUfl  M 

19.9  (10.8) 

19.0  (17.3) 

4.1. 3.3.4.  Summary  of  Characterization  of  Faults  Detected 

Summarizing  the  major  results  of  the  comparison  of  classes  of  faults  detect¬ 
ed:  1)  code  reading  and  functional  testing  both  detected  more  omission  faults 
and  Initialization  faults  than  did  structural  testing;  2)  code  reading  detected 
more  Interface  faults  than  did  the  other  methods;  3)  functional  testing  detected 
more  control  faults  than  did  the  other  methods;  4)  code  reading  detected  more 
computation  faults  than  did  structural  testing;  and  5)  the  on-line  techniques  of 
functional  and  structural  testing  were  not  different  In  any  classes  of  faults  ob¬ 


servable  but  not  reported. 


4.1.4.  Conclusions 


This  study  compares  the  strategies  of  code  reading,  functional  testing,  and 
structural  testing  across  three  data  sets  In  three  different  aspects  of  software 
testing:  fault  detection  effectiveness,  fault  detection  cost,  and  classes  of  faults 
detected.  Each  of  the  three  testing  techniques  showed  merit  In  this  evaluation. 
The  Investigation  was  Intended  to  compare  the  different  testing  strategies  In  a 
representative  testing  situation,  using  programmers  with  a  wide  range  of  experi¬ 
ence,  different  software  types,  and  common  software  faults. 

The  major  results  of  this  study  are  1)  with  the  professional  programmers, 
code  reading  detected  more  software  faults  and  had  a  higher  fault  detection  rate 
than  did  funct'onal  or  structural  testing,  with  functional  testing  detecting  more 
faults  than  did  structural  testing,  and  with  functional  and  structural  testing  not 
differing  In  fault  detection  rate;  2)  In  one  UoM  subject  group,  code  reading  and 
functional  testing  were  not  different  in  faults  found,  but  were  both  superior  to 
structural  testing,  while  In  the  other  UoM  subject  group  there  was  no  difference 
among  the  techniques;  3)  with  the  UoM  subjects,  the  three  techniques  were  not 
different  In  fault  detection  rate;  4)  number  of  faults  observed,  fault  detection 
rate,  and  total  effort  In  detection  depended  on  the  type  of  software  tested;  5) 
code  reading  detected  more  Interface  faults  than  did  the  other  methods:  and  6) 
functional  testing  detected  more  control  faults  than  did  the  other  methods. 

In  comparing  these  results  to  related  studies,  we  find  mixed  conclusions.  A 
prototype  analysis  done  at  the  University  of  Maryland  In  the  Fall  of  19S1 
[Hwang  81]  supported  the  belief  that  code  reading  by  stepwise  abstraction  does 


\  *.  \  *.  *.  A  % 


as  well  as  the  computer-based  methods,  with  each  strategy  having  its  own  ad¬ 
vantages.  In  the  Myers  experiment  [Myers  78],  the  three  techniques  compared 
(functional  testing,  3-person  code  reviews,  control  group)  were  equally  effective. 
He  also  calculated  that  code  reviews  were  less  cost-effective  than  the  computer- 
based  testing  approaches.  The  first  observation  Is  supported  In  one  study  phase 
here,  but  the  other  observation  Is  not.  A  study  conducted  by  Hetzel  [Hetzel  76] 
compared  functional  testing,  code  reading,  and  “selective”  testing  (a  composite 
of  functional,  structural,  and  reading  techniques).  He  observed  that  functional 
and  “selective"  testing  were  equally  effective,  with  code  reading  being  Inferior. 
.As  noted  earlier,  this  Is  not  supported  by  this  analysis.  The  study  described  In 
thlr  analysis  examined  the  technique  of  code  reading  by  stepwise  abstraction, 
while  both  the  Myers  and  Hetzel  studies  examined  alternate  approaches  to  off¬ 
line  (nonexecutlon-based)  review/reading. 

A  few  remarks  are  appropriate  about  the  comparison  of  the  cost- 
effectiveness  and  phase-avallablllty  of  these  testing  techniques.  When  examin¬ 
ing  the  effort  associated  with  a  technique,  both  fault  detection  and  fault  Isola¬ 
tion  costs  should  be  compared.  The  code  readers  have  both  detected  and  Isolat¬ 
ed  a  fault;  they  located  It  In  the  source  code.  Thus,  the  reading  process  con¬ 
denses  fault  detection  and  Isolation  Into  one  activity.  Functional  and  structural 
testers  have  only  detected  a  fault;  they  need  to  delve  Into  the  source  code  and 
expend  additional  effort  In  order  to  Isolate  the  defect.  Also,  a  nonexecutlon- 
based  reading  process  can  be  applied  to  any  document  produced  during  the  de¬ 
velopment  process  (e.g.,  high-level  design  document,  low-level  design  document. 


81 


source  code  document).  While  functional  and  structural  execution-based  tech¬ 
niques  may  only  be  applied  to  documents  that  are  executable  (e.g.,  source  code), 
which  are  usually  available  later  In  the  development  process. 

Investigations  related  to  this  work  Include  studies  of  fault  classification 
[Weiss  &  Baslll  85,  Johnson,  Draper  &  Soloway  83,  Ostrand  &  Weyuker  83, 
Baslll  &  Perrlcone  84]  and  Cleanroom  software  development  [Selby,  Baslll  & 
Baker  85].  In  the  Cleanroom  software  development  approach,  techniques  such 
as  code  reading  are  used  In  the  development  of  software  completely  off-line  (l.e., 
without  program  execution).  In  the  above  study,  systems  developed  using 
Cleanroom  met  system  requirements  more  completely  and  had  a  higher  percen¬ 
tage  of  successful  operational  test  cases  than  did  systems  developed  with  a  more 
traditional  approach. 

This  empirical  study  Is  Intended  to  advance  the  understanding  of  how  vari¬ 
ous  software  testing  strategies  contribute  to  the  software  development  process 
and  to  one  another.  The  results  given  were  calculated  from  a  set  of  Individuals 
applying  the  three  techniques  to  unit-sized  programs  -  the  direct  extrapolation 
of  the  findings  to  other  testing  environments  Is  not  implied.  However,  valuable 


insights  Into  software  testing  have  been  gained. 


4.2.  Cleanroom  Development  Approach  Analysis 

The  need  for  discipline  In  the  software  development  process  and  for  high 
quality  software  motivates  the  Cleanroom  software  development  approach.  In 
addition  to  Improving  the  control  during  development,  this  approach  Is  Intended 
to  deliver  a  product  that  meets  several  quality  aspects:  a  system  that  conforms 
with  the  requirements,  a  system  with  high  operational  reliability,  and  source 
code  that  Is  easily  readable  and  modifiable. 

The  next  section  describes  the  Cleanroom  approach  and  a  framework  of 
goals  for  characterizing  Its  effect.  The  following  section  presents  an  empirical 
study  using  the  approach.  The  results  are  then  given  of  an  analysis  comparing 
projects  developed  using  Cleanroom  with  those  of  a  control  group.  The  overall 
conclusions  are  presented  In  a  final  section. 

4.2.1.  Cleanroom  Software  Development  Method 

The  Federal  Systems  Division  of  IBM  [Dyer  82c,  Dyer  &  Mills  82]  presents 
the  Cleanroom  software  development  method  as  a  technical  and  organizational 
approach  to  developing  software  with  certifiable  reliability.  The  Idea  Is  to  deny 
the  entry  of  defects  during  the  development  of  software,  hence  the  term  “Clean- 
rocm."  The  focus  of  the  methou  Is  tmpos'ng  discipline  on  the  development  pro¬ 
cess  by  Integrating  formal  methods  for  specification  and  design,  complete  off-line 
development,  ana  statistically  based  testing.  These  components  are  Intended  to 
contribute  to  a  software  product  that  has  a  high  probability  of  zero  defects  and 
consequently  a  high  measure  of  operational  reliability. 


83 


AD-A168  738 
UNCLASSIFIED 


EVALUATIONS  OF  SOFTMARE  TECHNOLOGIES:  TESTING  CLEANROOH  2/* 
AND  NETRICSCU)  HARVLAND  UNIV  COLLEGE  PARK  DEPT  OF 
CONPUTER  SCIENCE  R  N  SELBV  HAV  83  TR-1300 
AF0SR-TR-86-8279  F49628-88-C-8881  F/O  972  NL 


The  mathematically-based  design  methodology  of  Cleanroom  Includes  the 
use  of  structured  specifications  and  state  machine  models  [Ferrentlno  &  Mills 
77],  A  systems  engineer  Introduces  the  structured  specifications  to  restate  the 
system  requirements  precisely  and  organize  the  complex  problems  Into  manage¬ 
able  parts  [Parnas  72bj.  The  specifications  determine  the  "system  architecture” 
of  the  Interconnections  and  groupings  of  capabilities  to  which  state  machine 
design  practices  can  be  applied.  System  Implementation  and  test  data  formula¬ 
tion  can  then  proceed  from  the  structured  specifications  Independently. 

The  rlght-the-flrst-tlme  programming  methods  used  In  Cleanroom  are  the 
Ideas  of  functionally  based  programming  In  [Mills  72a,  Linger,  Mills  &  Witt  79]. 
The  testing  process  Is  completely  separated  from  the  development  process  by 
not  allowing  the  developers  to  test  and  debug  their  programs.  The  developers 
focus  on  the  techniques  of  code  Inspections  [Fagan  79],  group  walkthroughs 
[Myers  78],  and  formal  verification  [Hoare  69,  Linger,  Mills  &  Witt  79,  Shankar 
82,  Dyer  83]  to  assert  the  correctness  of  their  Implementation.  These  construc¬ 
tive  techniques  apply  throughout  all  phases  of  development,  and  condense  the 
activities  of  defect  detection  and  Isolation  Into  one  operation.  This  discipline  Is 
imposed  with  the  intention  that  correctness  Is  "designed''  Into  the  software,  not 
"tested"  in.  The  notion  that  "Well,  the  software  should  always  be  tested  to 
find  the  faults”  Is  eliminated. 

In  the  statistically  based  testing  strategy  of  Cleanroom.  Independent  testers 
simulate  the  operational  environment  of  the  system  with  random  testing.  This 
testing  process  Includes  defining  the  frequency  distribution  of  Inputs  to  the  sys- 


84 


tem,  the  frequency  distribution  of  different  system  states,  and  the  expanding 
hierarchy  of  developed  system  capabilities.  Test  cases  then  are  chosen  random¬ 
ly  and  presented  to  the  series  of  product  releases,  while  concentrating  on  func¬ 
tions  most  recently  delivered  and  maintaining  the  overall  composite  distribution 
of  Inputs.  The  Independent  testers  then  record  observed  failures  and  determine 
an  objective  measure  of  product  reliability.  It  Is  believed  that  the  prior 
knowledge  that  a  system  will  be  evaluated  by  random  testing  will  affect  system 
reliability  by  enforcing  a  new  discipline  Into  the  system  developers. 


4. 2. 1.1.  Investigation  Goals 


Some  Intriguing  aspects  of  the  Cleanroom  approach  Include  l)  development 
without  testing  and  debugging  of  programs,  2)  Independent  program  testing  for 
quality  assurance  (rather  than  to  And  faults  or  to  prove  “correctness”  [Howden 
76]),  and  3)  certification  of  system  reliability  before  product  delivery.  In  order 
to  understand  the  effects  of  using  Cleanroom,  the  following  three  goals  are  pro¬ 
posed:  1)  characterize  the  effect  of  Cleanroom  on  the  delivered  product,  2) 
characterize  the  effect  of  Cleanroom  on  the  software  development  process,  and 
3)  characterize  the  effect  of  Cleanroom  on  the  developers.  An  application  of  the 
goal/questlon/metrlc  paradigm  [Basil!  <£:  Selby  84.  Baslll  &  Weiss  84]  leads  to 
the  framework  of  goals  and  questions  for  this  study  appearing  In  Figure  20. 
The  empirical  study  executed  to  pursue  these  goals  Is  described  In  the  following 
section. 


85 


4. 2. 2.1.  Case  Study  Description 


Subjects  for  the  empirical  study  came  from  the  "Software  Design  and  De¬ 
velopment"  course  taught  by  F.  T.  Baker  and  V.  R.  Baslll  at  the  University  of 
Maryland  In  the  Falls  of  1982  and  1983.  The  Initial  segment  of  the  course  was 
devoted  to  the  presentation  of  several  software  development  methodologies,  In¬ 
cluding  top-down  design,  modular  specification  and  design,  PDL,  chief  program¬ 
mer  teams,  program  correctness,  code  reading,  walkthroughs,  and  functional 
and  structural  testing  strategies.  For  the  latter  part  of  the  course,  the  Individu¬ 
als  were  divided  Into  three-person  chief  programmer  teams  for  a  group  project 
[Baker  72b,  Mills  72b,  Baker  81].  We  attempted  to  divide  the  teams  equally  ac¬ 
cording  to  professional  experience,  academic  performance,  and  Implementation 
language  experience.  The  subjects  had  an  average  of  1.8  years  professional  ex¬ 
perience  and  were  computer  science  majors  with  Junior,  senior,  or  graduate 
standing.  Figure  21  displays  the  distribution  of  the  subjects'  professional  ex¬ 
perience. 

Figure  21.  Subjects'  professional  experience  In  years. _ 

x 

X  X 

XXX  X 

X  XXXX  X 

X  XXXXX  XX  X 

X  XXXXX  X  XX  XX  XX  XX  XX  X  X 


A  requirements  document  for  an  electronic  message  system  (read,  send, 
mailing  lists,  authorized  capabilities,  etc.)  was  distributed  to  each  of  the  teams. 
The  project  was  to  be  completed  In  six  weeks  and  was  expected  to  be  about 
1200  lines  of  Slmpl-T  source  [Baslll  &  Turner  78].  7  The  development  machine 
was  a  Unlvac  1100/82  running  EXEC  Vin,  with  1200  baud  Interactive  and  re¬ 
mote  access  available. 

The  ten  teams  In  the  Fall  1982  course  applied  the  Cleanroom  software  de¬ 
velopment  approach,  while  the  five  teams  In  the  Fall  1983  course  served  as  a 
control  group  (non-Cleanroom).  All  other  aspects  of  the  developments  were  the 
same.  The  two  groups  of  teams  were  not  statistically  different  In  terms  of  pro¬ 
fessional  experience,  academic  performance,  or  Implementation  language  experi¬ 
ence.  If  there  were  any  bias  between  the  two  times  the  course  was  taught,  It 
would  be  In  favor  of  the  1983  (non-Cleanroom)  group  because  the  modular 
design  portion  of  the  course  was  presented  earlier.  It  was  also  the  second  time 
F.  T.  Baker  had  taught  the  course.  Note  that  the  teams  In  the  non-Cleanroom 
group  applied  a  development  approach  similar  to  the  "disciplined  team"  ap¬ 
proach  examined  In  an  earlier  study  [Baslll  &  Reiter  81]. 

The  first  document  every  team  in  either  group  turned  In  contained  a  sys¬ 
tem  specification,  composite  design  diagram,  and  Implementation  plan.  The 

'  Slmpl-T  Is  a  structured  language  that  supports  several  string  and  file  han¬ 
dling  primitives.  In  addition  to  the  usual  control  flow  constructs  available,  for 
example.  In  Pascal.  If  Pascal  or  FORTRAN  had  been  chosen.  It  would  have 
been  very  likely  that  some  Individuals  would  have  had  extensive  experience  with 
the  language,  and  this  would  have  biased  the  comparison.  Also,  restricting  ac¬ 
cess  to  a  compiler  that  produced  executable  cede  would  have  been  very  difficult. 


latter  element  was  a  series  of  milestones  describing  when  the  various  functions 
within  the  system  would  be  available.  At  these  various  dates  (minimum  one 
week  apart,  maximum  two),  teams  from  both  groups  would  then  submit  their 
systems  for  testing.  An  Independent  party  would  then  apply  statistically  based 
testing  to  each  of  these  deliveries  and  report  to  the  team  members  both  the  suc¬ 
cessful  and  unsuccessful  test  cases.  The  latter  would  be  Included  In  the  next 
test  session  for  verification.  Recall  that  the  Cleanroom  teams  could  not  execute 
their  programs  -  they  had  editing  and  syntax-checking  capabilities  only.  They 
had  to  rely  on  the  techniques  of  code  reading,  structured  walkthroughs,  and  In¬ 
spections  to  prepare  their  programs  before  submission.  On  the  other  hand,  the 
non-Cleanroom  teams  had  full  access  to  compilation  and  execution  facilities  to 
test  their  systems  prior  to  Independent  testing. 

All  team  projects  were  evaluated  on  the  use  of  the  development  techniques 
presented  In  class,  the  independent  testing  results,  and  a  final  oral  interview.  In 
addition  to  these  sources,  Information  on  the  team  projects  was  collected  from  a 
background  questionnaire,  a  postdevelopment  attitude  survey,  static  source  code 
analysis,  and  operating  system  statistics.  The  following  section  briefly  describes 
the  operationally  based  testing  process  applied  to  all  projects  by  the  Indepen¬ 
dent  tester. 

4. 2. 2. 2.  Operational  Testing  of  Projects 

The  testing  approach  used  In  Cleanroom  Is  to  simulate  the  developing 


system's  environment  by  randomly  selecting  test  data  from  an  "operational 


profile,"  a  frequency  distribution  of  Inputs  to  the  system  [Thayer,  Llpow  &  Nel¬ 
son  78,  Duran  &  Ntafos  81].  The  projects  from  both  groups  were  tested  Interac¬ 
tively  at  the  milestones  chosen  by  each  team  by  an  Independent  party  (l.e.,  R. 
W.  Selby).  A  distribution  of  Inputs  to  the  system  was  obtained  by  Identifying 
the  logical  functions  In  the  system  and  assigning  each  a  frequency.  This  fre¬ 
quency  assignment  was  accomplished  by  polling  eleven  well-seasoned  users  of 
the  University  of  Maryland  Vax  11/780  mailing  system.  Then  test  data  were 
generated  randomly  from  this  profile  and  presented  to  the  system.  Recording  of 
failure  severity  and  times  between  failure  took  place  during  the  testing  process. 
The  operational  statistics  referred  to  later  were  calculated  from  fifty  user-session 
test  cases  run  on  the  final  system  release  of  each  team.  For  a  complete  explana¬ 
tion  of  the  operationally  based  testing  process  applied  to  the  projects.  Including 
test  data  selection,  testing  procedure,  and  failure  observation,  see  Appendix  C. 

4.2.3.  Data  Analysis  and  Interpretation 

The  analysis  and  Interpretation  of  the  data  collected  from  the  study  appear 
In  the  following  sections,  organized  by  the  goal  areas  outlined  earlier.  In  order 
to  address  the  various  questions  posed  under  each  of  the  goals,  some  raw  data 
usually  will  be  presented  and  then  Interpreted.  Figure  22  presents  the  number 
of  source  lines,  executable  statements,  and  procedures  and  functions  to  give  a 
rough  view  of  the  systems  developed. 


Figure  22.  System  statistics. 


Team 

Clean  room 

Source 

Lines 

Executable 

Statments 

Procedures  & 
Functions 

A 

yes 

1681 

813 

55 

B 

yes 

1626 

717 

42 

C 

yes 

1118 

573 

42 

D 

yes 

1046 

477 

30 

E 

yes 

1087 

624 

32 

F 

yes 

1213 

440 

35 

G 

yes 

1196 

581 

31 

H 

yes 

1876 

550 

51 

I 

yes 

1305 

608 

23 

J 

yes 

1052 

658 

24 

a 

no 

824 

410 

26 

b 

no 

1429 

633 

18 

c 

no 

2284 

999 

46 

d 

no 

1629 

628 

67 

e 

no 

1310 

_ 

459 

43 

4. 2. 3.1.  Characterization  of  the  Effect  on  the  Product  Developed 

This  section  characterizes  the  differences  between  the  products  delivered  by 
both  of  the  development  groups.  Initially  we  examine  some  operational  proper¬ 
ties  of  the  products,  followed  by  a  comparison  of  some  of  their  static  properties. 

4. 2. 3. 1.1.  Operational  System  Properties 

In  order  to  contrast  the  operational  properties  of  the  systems  delivered  by 
the  two  groups,  both  completeness  of  Implementation  and  operational  testing 
results  were  examined.  A  measure  of  Implementation  completeness  was  calcu¬ 
late'!  by  partitioning  the  required  system  Into  sixteen  logical  functions  i  e .  g . , 


send  mall  to  an  Individual,  read  a  piece  of  mall,  respond,  add  yourself  to  a  mail¬ 
ing  list.  ...t.  Each  function  In  an  Implementation  was  then  assigned  a  value 


two  If  It  completely  met  Its  requirements,  a  value  of  one  If  It  partially  met 
them,  or  zero  If  It  was  Inoperable.  The  total  for  each  system  was  calculated;  a 
maximum  score  of  32  was  possible.  Figure  23  displays  this  subjective  measure 
of  requirement  conformance  for  the  systems.  Note  that  In  all  figures  presented, 
the  ten  teams  using  Cleanroom  are  in  upper  case  and  the  five  teams  using  a 
more  conventional  approach  are  In  lower  case.  A  first  observation  Is  that  six  of 
the  ten  Cleanroom  teams  built  very  close  to  the  entire  system.  While  not  all  of 
the  Cleanroom  teams  performed  equally  well,  a  majority  of  them  applied  the 
approach  effectively  enough  to  develop  nearly  the  whole  product.  More  Impor¬ 
tantly,  the  Cleanroom  teams  met  the  requirements  of  the  system  more  com¬ 
pletely  than  did  the  non-Cleanroom  teams. 


To  compare  testing  results  among  the  systems  developed  In  the  two  groups 
fifty  random  user-session  test  cases  were  executed  on  the  final  release  of  eacl 
system  to  simulate  Its  operational  environment.  If  the  final  release  of  a  system 

8  The  significance  levels  for  the  Mann- Whitney  statistics  reported  ar--  *!;• 
probability  of  Type  I  error  In  an  one-. ailed  test. 


performed  to  expectations  on  a  test  case,  the  outcome  was  called  a  “success;”  1 


not,  the  outcome  was  a  "failure.”  If  the  outcome  was  a  “failure”  but  the  same 
failure  was  observed  on  an  earlier  test  case  run  on  the  final  release,  the  outcome 
was  termed  a  “duplicate  failure.”  Figure  24  shows  the  percentage  of  successful 
test  cases  when  duplicate  failures  are  not  Included.  The  figure  displays  that 
Cleanroom  projects  had  a  higher  percentage  of  successful  test  cases  at  system 
delivery.  9  When  duplicate  failures  are  Included,  however,  the  better  perfor¬ 
mance  of  the  Cleanroom  systems  Is  not  nearly  as  significant  (MW  =  .134).  10 
This  Is  caused  by  the  Cleanroom  projects  having  a  relatively  higher  proportion 
of  duplicate  failures,  even  though  they  did  better  overall.  This  demonstrates 
that  while  reviewing  the  code,  the  Cleanroom  developers  focused  less  than  the 
other  groups  on  certain  parts  of  the  system.  The  more  uniform  review  of  the 
whole  system  makes  the  performance  of  the  system  less  sensitive  to  Its  opera¬ 
tional  profile.  Note  that  operational  environments  of  systems  are  usually 
difficult  to  define  a  priori  and  are  subject  to  change. 


9  Although  not  considered  here,  various  software  reliability  models  have 
been  proposed  to  forecast  system  reliability  based  on  failure  data  [Musa  75. 
Currlt  83,  Goel  83]. 

10  To  be  more  succinct,  MW  will  sometimes  be  used  to  abbreviate  the 
significance  level  of  the  Mann- Whitney  statistic. 


These  comparisons  suggest  that  the  non-Cleanroom  developers  focused  on  a 
“perspective  of  the  tester,”  sometimes  leaving  out  classes  of  functions  and  caus¬ 
ing  a  less  completely  Implemented  product  and  more  (especially  unique)  failures. 
Off-line  review  techniques,  however,  are  more  general  and  their  use  contributed 
to  more  complete  requirement  conformance  and  fewer  failures  In  the  Cleanroom 
products.  In  addition  to  examining  the  operational  properties  of  the  product, 
various  static  properties  were  compared. 

4. 2. 3. 1.2,  Static  System  Properties 

The  first  question  In  this  goal  area  concerns  the  size  of  the  final  systems. 
Figure  22  showed  the  number  of  source  lines,  executable  statements,  and  pro¬ 
cedures  and  functions  for  the  various  systems.  The  projects  from  the  two 
groups  were  not  statistically  different  (MW  >  .10)  In  any  of  these  three  size  at¬ 
tributes.  Another  question  In  this  goal  area  concerns  the  readability  of  the 
delivered  source  code.  Two  aspects  of  reading  and  modifying  code  are  the 
number  of  comments  present  and  the  density  of  the  "complexity.”  In  an  at¬ 
tempt  to  capture  the  complexity  density,  syntactic  complexity  [Baslll  A 
Hutchens  S3]  was  calculated  and  normalized  by  the  number  of  executable  state¬ 
ments.  In  addition  to  control  complexity,  the  syntactic  complexity  metric  con¬ 
siders  nesting  depth  and  prime  program  decomposition  [Linger,  Mills  &  Witt 
79].  The  developers  using  Cleanroom  wrote  code  that  was  more  highly  com¬ 
mented  (MW  =  .0S9 )  and  had  a  lower  complexity  density  (MW  =  .079)  than 
did  those  using  the  traditional  approach.  A  calculation  of  either  software  scl- 


ence  effort  [Halstead  77],  cyclomatlc  complexity  [McCabe  78],  or  syntactic  com¬ 
plexity  without  any  size  normalization,  however,  produced  no  significant 


differences  (\r\V  >  .10).  This  seems  as  expected  because  all  the  systems  were 
built  to  meet  the  same  requirements. 

Comparing  the  data  usage  In  the  systems,  Cleanroom  developers  used  a  ! 

greater  number  of  global  data  Items  (MW  =  .071).  Also,  Cleanroom  projects 
possessed  a  higher  percentage  of  assignment  statements  (MW  =  .056).  These  • 

last  two  observations  could  be  a  manifestation  of  teaching  the  Cleanroom  sub¬ 
jects  modular  design  later  In  the  course  (see  Case  Study  Description),  or  possl-  ■ 

P 

bly  an  Indication  of  using  the  approach.  I 

Some  Interesting  observations  surface  when  the  operational  quality  meas-  < 

ures  of  the  Cleanroom  products  are  correlated  with  the  usage  of  the  Implemen¬ 
tation  language.  Both  percentage  of  successful  test  cases  (without  duplicate 
failures)  and  Implementation  completeness  correlated  with  percentage  of  pro¬ 
cedure  calls  ''Spearman  R  =  .85,  slgnlf.  =  .044,  and  R  =  .57,  slgnlf.  =  .08, 
respectively)  and  with  percentage  of  If  statements  (R  =  .62,  slgnlf.  =  .058,  and 
R  =  .55,  slgnlf.  —  .10,  respectively).  However,  both  of  these  two  product  quali¬ 
ty  measures  correlated  negatively  with  percentage  of  case  statements  (R  =  -.S6, 
slgnlf.  —  .001,  and  R  =  -.69,  slgnlf.  =  .027,  respectively)  and  with  percentage 
of  while  statements  (R  =  -.65.  slgnlf.  =  .044.  and  R  =  -.49.  slgnlf.  =  .15.  *  I 

respectively).  There  were  also  some  negative  correlations  between  the  product 
quality  measures  and  the  average  software  science  effort  per  subroutine  (R  = 


-.52.  slgnlf.  =  .12,  and  R  =  -.74.  slgnlf.  =  .013.  respectively)  anil  the  average 


number  of  occurrences  of  a  variable  (R  =  -.54,  slgnlf.  =  .11,  and  R  =  -.56, 
slgnlf.  =  .09,  respectively).  Considering  the  products  from  all  teams,  both  per¬ 
centage  of  successful  test  cases  (without  duplicate  failures)  and  Implementation 
completeness  had  some  correlation  with  percentage  of  If  statements  (R  =  .48, 
slgnlf.  =  .07,  and  R  =  .45,  slgnlf.  =  .09,  respectively)  and  some  negative  corre¬ 
lation  with  percentage  of  case  statements  (R  =  -.48,  slgnlf.  =  .07,  and  R  = 
-.42,  slgnlf.  =  .12,  respectively).  Neither  of  the  operational  product  quality 
measures  correlated  with  percentage  of  assignment  statements  when  either  all 
products  or  Just  Cleanroom  products  were  considered.  These  observations  sug¬ 
gest  that  the  more  successful  Cleanroom  developers  simplified  their  use  of  the 
Implementation  language;  l.e.,  they  used  more  procedure  calls  and  If  statements, 
used  fewer  case  and  while  statements,  had  a  lower  frequency  of  variable  reuse, 
and  wrote  subroutines  requiring  less  software  science  effort  to  comprehend. 

4. 2. 3. 1.3.  Contribution  of  Programmer  Background 

When  examining  the  contribution  of  the  Cleanroom  programmers'  back¬ 
ground  to  the  quality  of  their  final  products,  general  programming  language  ex¬ 
perience  correlated  with  percentage  of  successful  operational  tests  (without  du¬ 
plicate  failures:  Spearman  R  =  .66.  slgnlf.  =  .04:  with  duplicates:  R  =  .70.  slg¬ 
nlf.  =  .03)  and  with  Implementation  completeness  (R  =  .55:  slgnlf.  =  .10).  No 
relationship  appears  between  either  operational  testing  results  or  Implementa¬ 
tion  completeness  and  either  professional 12  or  testing  experience.  These 

1_  In  fact,  there  are  very  slight  negative  correlations  between  years  of  pro¬ 
fessional  experience  and  both  percentage  of  successful  tests  (without  duplicate 


background/quality  relations  seem  consistent  with  other  studies  [Curtis  83]. 

4.2. 3. 1.4.  Summary  of  the  Effect  on  the  Product  Developed 

In  summary,  Cleanroom  developers  delivered  a  product  that  1)  met  system 
requirements  more  completely,  2)  had  a  higher  percentage  of  successful  test 
cases,  3)  had  more  comments  and  less  dense  complexity,  and  4)  used  more  glo¬ 
bal  data  Items  and  a  higher  percentage  of  assignment  statements.  The  more 
successful  Cleanroom  developers  1)  used  more  procedure  calls  and  If  statements, 
2)  used  fewer  case  and  while  statements,  3)  reused  variables  less  frequently,  4) 
developed  subroutines  requiring  less  (software  science)  efTort  to  comprehend,  and 
5)  had  more  general  programming  language  experience. 

4.2.3. 2.  Characterization  of  the  Effect  on  the  Development  Process 

In  a  postdevelopment  attitude  survey,  the  developers  were  asked  how 
effectively  they  felt  they  applied  off-line  review  techniques  In  testing  their  pro¬ 
jects  (see  Figure  25).  This  was  an  attempt  to  capture  some  of  the  Information 
necessary  to  answer  the  first  question  under  this  goal  (question  H.A).  In  order 
to  make  comparisons  at  the  team  level,  the  responses  from  the  members  of  a 
team  are  composed  Into  an  average  for  the  team.  The  responses  to  the  question 
appear  on  a  team  basis  In  a  histogram  In  the  second  part  of  the  figure.  Of  the 
Cleanroom  developers,  teams  ’A,’  ’D,'  'E,'  'F,'  and  T  were  the  least  confident  In 
their  use  of  the  off-line  review  techniques  and  these  teams  also  performed  the 

failures:  R  =  -.46,  slgnlf.  =  .18)  and  Implementation  completeness  (R  =  -  .47. 
slgnlf.  =  .17). 


worst  In  terms  of  operational  testing  results;  four  of  these  five  teams  performed 
the  worst  in  terms  of  Implementation  completeness.  Off-line  review  effectiveness 
correlated  with  percentage  of  successful  operational  tests  (without  duplicate 
failures)  for  the  Cleanroom  teams  (Spearman  R  =  .74;  slgnlf.  =  .014)  and  for 
all  the  teams  (R  =  .78;  slgnlf.  =  .001);  It  correlated  with  Implementation  com¬ 
pleteness  for  all  the  teams  (R  =  .5S;  slgnlf.  =  .023).  Neither  professional  nor 
testing  experience  correlated  with  off-line  review  effectiveness  when  either  all 
teams  or  Just  Cleanroom  teams  were  considered. 


The  histogram  In  Figure  25  shows  that  the  Cleanroom  developers  felt  they 
applied  the  off-line  review  techniques  more  effectively  than  did  the  non- 
Cleanroom  teams.  The  non-Cleanroom  developers  were  asked  to  give  a  relative 
breakdown  of  the  amount  of  time  spent  applying  testing  and  verification  tech¬ 
niques.  Their  aggregate  response  was  ShFc  off-line  review,  od^c  functional  test¬ 
ing,  and  9%  structural  testing.  From  this  breakdown,  we  observe  that  the 


13  There  are  half-responses  because  an  Individual  checked  both  the  second 
and  third  choices.  The  responses  total  to  28,  not  30,  because  two  separate 
teams  iost  a  member  late  In  the  project.  (See  Distinction  Among  Teams I. 


non-Cleanroom  teams  primarily  relied  on  functional  testing  to  prepare  their  sys¬ 
tems  for  Independent  testing.  Since  the  Cleanroom  teams  were  unable  to  rely 
on  testing  methods,  they  may  have  (felt  they  had)  applied  the  off-line  review 
techniques  more  effectively. 

Since  the  role  of  the  computer  Is  more  controlled  when  using  Cleanroom, 
one  would  expect  a  difference  In  on-line  activity  between  the  two  groups.  Fig¬ 
ure  26  displays  the  amount  of  connect  time  that  each  of  the  teams  cumulatively 
used.  A  comparison  of  the  cpu-tlme  used  by  the  teams  was  less  statistically 
significant  (MW  =  .110).  Neither  of  these  measures  of  on-line  activity  related 
to  how  effectively  a  team  felt  they  had  used  the  off-line  techniques  when  either 
all  teams  or  Just  Cleanroom  teams  were  considered.  Although  non-Cleanroom 
team  ’d’  did  a  lot  of  on-line  testing  and  non-Cleanroom  team  'e'  did  little,  both 
teams  performed  poorly  In  the  measures  of  operational  product  quality  discussed 
earlier.  The  operating  system  of  the  development  machine  captured  these  sys¬ 
tem  usage  statistics.  Note  that  the  time  the  independent  party  spent  testing  Is 
Included.  M  These  observations  exhibit  that  Cleanroom  developers  spent  less 
time  on-line  and  used  fewer  computer  resources.  These  results  empirically  sup¬ 
port  the  reduced  role  of  the  computer  In  Cleanroom  development. 


When  the  time  the  Independent  tester  spent  Is  not  Included,  the 
significance  levels  for  the  non-parametrlc  statistics  do  not  change. 


Figure  28.  Connect  time  In  hours  during  project  development.  15 

G 

BE  C  I  HF  D  JA 
e  b  c  a 

0.0 

_ Mann- Whitney  slgnlf.  =  .089 

Schedule  slippage  continues  to  be  a  problem  In  software  development.  It 
would  be  Interesting  to  see  whether  the  Cleanroom  teams  demonstrated  any 
more  discipline  by  maintaining  their  original  schedules.  All  of  the  teams  from 
both  groups  planned  four  releases  of  their  evolving  system,  except  for  team  'G' 
which  planned  five.  Recall  that  at  each  delivery  an  Independent  party  would 
operationally  test  the  functions  currently  available  In  the  system,  according  to 
the  team's  Implementation  plan.  In  Figure  27,  we  observe  that  all  the  teams  us¬ 
ing  Cleanroom  kept  to  their  original  schedules  by  making  all  planned  deliveries; 
only  two  non-Cleanroom  teams  made  all  their  scheduled  deliveries. 


d 

155.0 


10  Xon-Cleanroom  team  'e'  entered  a  substantial  portion  of  Its  system  on  a 
remote  machine,  only  using  the  Unlvac  computer  mainly  for  compilation  and 
execution.  (See  Distinction  Among  Teams.) 


102 


Figure  27.  Number  of  system  releases. 


J 

I 

H 

F 

E 

D 

C 

B 


A  G 

e  c 

d  a  b 

0  1  2  3  4  5  6 


Mann-Whltney  slgnlf.  =  .006 


4. 2.3. 2.1.  Summary  of  the  Effect  on  the  Development  Process 

Summarizing  the  effect  on  the  development  process,  Cleanroom  developers 
1)  felt  they  applied  off-line  review  techniques  more  effectively,  while  non- 
Cleanroom  teams  focused  on  functional  testing;  2)  spent  less  time  on-line  and 
used  fewer  computer  resources;  and  3)  made  all  their  scheduled  deliveries. 

4.2. 3.3.  Characterization  of  the  Effect  on  the  Developers 

The  first  question  posed  In  this  goal  area  Is  whether  the  Individuals  using 
Cleanroom  missed  the  satisfaction  of  executing  their  own  programs.  Figure  2S 
presents  the  responses  to  a  question  included  In  the  postdevelopment  attitude 
survey  on  this  Issue.  As  might  be  expected,  almost  all  the  Individuals  missed 
some  aspect  of  program  execution.  .As  might  not  be  expected,  however,  this 


missing  of  program  execution  had  no  relation  to  either  the  product  quality 


measures  mentioned  earlier  or  the  teams'  professional  or  testing  experience. 
Also,  missing  program  execution  did  not  Increase  with  respect  to  program  size 
(see  Figure  29). 


104 


Figure  28.  Breakdown  of  responses  to  the  attitude  survey  question,  "Did 
you  miss  the  satisfaction  of  executing  your  own  programs?". 


Missed  I 

Program  4- 
Execution  | 

I 

Some  -  I 


4.0  4 — 
921 .0 


2001 .0 


No  (3.0) 


Source  Lines 


pearman  correlations:  -No  (slgnlf.  =  .002)  with  source  lines:  -.70  (slgnlf 
.03)  with  number  separately  compilable  modules:  -.57  (.slgnlf.  = 
with  number  procedures  and  functions. 


ore  people  did  nor  modify  their  develop- 


05 


ment  style  when  applying  the  techniques  of  Cleanroom.  Several  persons  men¬ 


tioned,  however,  that  they  already  utilized  some  of  the  Ideas  In  Cleanroom. 
Keeping  a  simple  design  supports  readability  of  the  product  and  facilitates  the 
processes  of  modification  and  verification.  Although  some  of  the  objective  pro¬ 
duct  measures  presented  earlier  showed  differences  In  development  style,  these 
subjective  ones  are  Interesting  and  lend  Insight  Into  actual  programmer 
behavior. 


question,  "How  was  y 
le  to  test  and  debug? 


2  -  Yes,  my  style  was  substantially  revised. 
15-1  modified  some  of  my  tendencies. 

11  -  It  did  not  affect  my  style  at  all. 


Frequently  mentioned  responses  Include 

-  kept  design  simple,  attempted  nothing  fancy 

-  kept  readability  of  code  In  mind 

-  already  was  a  user  of  off-line  review  techniques 

-  very  careful  scrutiny  of  code  for  potential  mistakes 

-  prepared  for  a  larger  range  of  Inputs 


One  Indicator  of  the  Impression  that  oomethlng  new  leaves  on  people  Is 
whether  they  would  do  It  again.  Figure  31  presents  the  responses  of  the  Indivi¬ 
duals  when  they  were  asked  whether  they  would  choose  to  vise  Cleanroom  as  -!- 
ther  a  software  development  manager  or  as  a  programmer.  Even  though  these 
responses  were  gathered  (Immediately)  after  course  completion.  _  objects  desiring 
to  "please  the  Instructor"  may  have  responded  favorably  to  this  type  of 


tlon  regardless  of  their  true  feelings.  Praetl**n;iv  evervone  Indicated  a  w 


ness  to  apply  the  approach  again.  It  Is  Ir/.-r-s’lug  '  ■  *  ha:  • 


number  of  persons  In  a  managerial  role  would  choose  to  always  use  It.  Of  the 
persons  that  ranked  the  reuse  of  Cleanroom  fairly  low  In  each  category,  four  of 
the  five  were  the  same  people.  Of  the  six  people  that  ranked  reuse  low,  four 
were  from  less  successful  projects  (one  from  team  ’A’,  one  from  team  -E’  and 
two  from  team  T),  but  the  other  two  came  from  reasonably  successful  develop¬ 
ments  (one  from  team  ’C’  and  one  from  team  ’J').  The  particular  Individuals 
on  teams  ’E,’  and  'J'  rated  the  reuse  fairly  low  In  both  categories. 


Figure  31. 

Breakdown  of  responses  to  the  attitude  survey  question,  “Would  you  use 
Cleanroom  again?".  (One  person  did  not  respond  to  this  question.) 


As  a  software  development  manager? 

8  -  Yes,  at  all  times 
14  -  Yes,  but  only  for  certain  projects 
5  -  Not  at  all 


As  a  programmer? 

4  -  Yes,  for  all  projects 

18  -  Yes,  but  not  all  the  time 

5  -  Only  If  I  had  to 

0  -  I  would  leave  If  I  had  to 


Two  separate  Cleanroom  teams,  ’H'  and  T,"  each  lost  a  member  late  In  the  pro¬ 
ject.  Thus  at  project  completion,  there  were  eight  three-person  and  two  two- 
person  Cleanroom  teams.  Recall  that  team  'H‘  performed  quite  well  according 
to  requirement  conformance  and  testing  results,  while  team  T  did  poorly.  Also, 
the  second  group  of  subjects  did  not  divide  evenly  Into  three-person  teams. 
Since  one  of  those  Individuals  had  extensive  professional  experience,  non- 
Cleanroom  team  'e‘  consisted  of  that  one  highly  experienced  person.  Thus  at 
project  completion,  there  were  four  three-person  and  one  one-person  non- 
Cleanroom  teams.  Although  team  'e'  wrote  over  1300  source  lines,  this  highly 
experienced  person  did  not  do  as  well  as  the  other  teams  In  some  respects.  This 
Is  consistent  with  another  study  In  which  teams  applying  a  "disciplined  method¬ 
ology"  In  development  outperformed  Individuals  [Baslll  &  Reiter  Si].  Figure  32 
contains  the  significance  levels  for  the  above  results  when  team  'e,'  when  teams 
'H'  and  'I,'  and  when  teams  'e,'  'H,'  and  T  are  removed  from  the  analysis.  Re¬ 
moving  teams  'H'  and  ’I’  has  little  effect  on  the  significance  levels,  while  the  re¬ 
moval  of  team  'e’  causes  a  decrease  In  all  of  the  significance  levels  except  for  ex¬ 
ecutable  statements,  software  science  effort,  cyclomatlc  complexity,  syntactic 


complexity,  connect-tlme,  and  cpu-tlme. 


.  Sumraar}’  of 


Measure 


measure  averages  and  significance  levels. 


Average 


Mann-Whitney 


Clean- 

Non- 

All 

With- 

With- 

room 

Clean- 

Teams 

out 

out 

Teams 

room 

Teams 

Team 

e 

Teams 

H.I 

1491.2 

.196 

625.4 

.500 

40.0 

.357 

60  0 

.088 

80.8 

.055 

59.2 

.134 

122.2 

.089 

1.6 

.079 

7355. 4e3 

.451 

212.2 

.250 

1017.0 

.500 

24.2 

.071 

26.6 

056 

2.5 

.065 

71.3 

089 

136.1 

.110 

2.6 

.006 

Source  lines 
Executable  stmts 
#Procedures  & 
functions 
^Implementation 
completeness 
^Successful  tests  (w/o 
duplicate  failures) 
^Successful  tests  (w/ 
duplicate  failures! 


#Comments 
Syntactic  complexity/ 
executable  stmts 
Software  Science  E 
Cyclomatic  complexity 
Syntactic  complexity 
Global  data  items 


^Assignment  stmts 
Off-line  effectiveness 
Connect-time  (hr.) 
Cpu-time  (min.) 
^Deliveries 


1320  0 
604.1 


1.5 

6728. 6e3 
106.8 
917.5 
37.6 


4.2.4.  Conclusions 


This  paper  describes  "Cleanroom"  software  development  -  an  approach  In¬ 
tended  to  produce  highly  reliable  software  by  Integrating  formal  methods  for 
specification  and  design,  complete  off-line  development,  and  statistically  based 
testing.  The  goal  structure,  experimental  approach,  data  analysis,  and  conclu¬ 
sions  are  presented  for  a  repllcated-projeet  study  examining  the  Cleanroom  ap¬ 
proach.  This  is  the  first  Investigation  known  to  the  authors  that  applied  Ciean- 
r<  •  m  and  charact  r'.zed  its  effect  relative  to  a  more  traditional  development  an- 


proach. 


The  data  analysis  presented  and  the  testimony  provided  by  the  developers 
suggest  that  the  major  results  of  this  study  are  1)  most  developers  were  able  to 
apply  the  techniques  of  Cleanroom  effectively;  2)  the  Cleanroom  teams’  pro¬ 
ducts  met  system  requirements  more  completely  and  had  a  higher  percentage  of 
successful  test  cases;  3)  the  source  code  developed  using  Cleanroom  had  more 
comments  and  less  dense  complexity;  4)  the  use  of  Cleanroom  successfully 
modified  aspects  of  development  style;  and  5)  most  Cleanroom  developers  Indi¬ 
cated  they  would  use  the  approach  again. 

It  seems  that  the  Ideas  In  Cleanroom  help  attain  the  goals  of  producing 
high  quality  software  and  Increasing  the  discipline  In  the  software  development 
process.  The  complete  separation  of  development  from  testing  appears  to  cause 
a  modification  In  the  developers'  behavior,  resulting  in  Increased  process  control 
and  In  more  effective  use  of  formal  methods  for  software  specification,  design, 
off-line  review,  and  verification.  It  seems  that  system  modification  and  mainte¬ 
nance  would  be  more  easily  done  on  a  product  developed  In  the  Cleanroom 
method,  because  of  the  product’s  thoroughly  conceived  design  and  higher  reada¬ 
bility.  Thus,  achieving  high  requirement  conformance  and  high  operational  reli¬ 
ability  coupled  with  low  maintenance  costs  would  help  reduce  overall  costs, 
satisfy  the  user  community,  and  support  a  long  product  lifetime. 

This  empirical  study  Is  Intended  to  advance  the  understanding  of  the  rela¬ 
tionship  between  Introducing  discipline  Into  the  development  process  (as  In 
Cleanroom)  and  several  aspects  of  product  quality:  conformance  with  require- 


ments,  high  operational  reliability,  and  easily  modifiable  source  code.  The 
results  given  were  calculated  from  a  set  of  teams  applying  Cleanroom  develop¬ 
ment  on  a  relatively  small  project  -  the  direct  extrapolation  of  the  findings  to 
other  projects  and  development  environments  Is  not  Implied.  Valuable  Insights, 
however,  have  been  gained  from  the  analysis. 


4.3.  Characteristic  Metric  Set  Study 

Several  metrics  have  been  proposed  to  predict  product  cost /quality  and  to 
capture  distinct  project  aspects  [McCabe  76,  Halstead  77,  Chen  78,  Gaffney  & 
Heller  80,  Behrens  83].  The  effectiveness  of  the  measures  In  capturing  what  Is 
Intended,  however,  has  depended  on  the  particular  environment  examined 
[Walston  &  Felix  77,  Curtis,  Sheppard  &  Mllllman  79,  Feuer  &  Fowlkes  79, 
Baslll  80,  Bailey  &  Baslll  81,  Boehm  81,  Brooks  81,  Zolnowskl  &  Simmons  81, 
Vosburgh  et  al.  84].  A  particular  software  metric  that  has  been  useful  to 
characterize,  evaluate,  or  predict  aspects  of  software  development  In  one  en¬ 
vironment  may  have  limited  usefulness  elsewhere.  The  differing  cost/quality 
goals  among  environments  and  the  diversity  In  methodology,  software  type,  etc. 
contribute  to  the  Inconsistent  performance  of  metrics.  Thus,  It  Is  Inappropriate 
to  attempt  to  select  a  set  of  software  metrics  that  have  universal  effectiveness 
across  all  software  environments.  The  selection  of  a  set  of  metrics  appropriate 
for  a  particular  environment  must  consider  Its  Individual  features;  that  Is,  a 
metric  set  must  be  customized  to  a  specific  environment. 

This  study  develops  an  approach  for  customizing  to  an  environment  a 
characteristic  set  of  c.ost  and  quality  measures.  The  approach  then  Is  applied  In 
a  software  production  environment.  This  section  describes  the  concept  of  a 
characteristic  software  metric  set.  Investigation  goals,  empirical  study,  and  data 
analysis. 


112 


4.3.1.  Characteristic  Software  Metric  Sets 


The  successful  management  of  software  projects  requires  a  diverse  range  of 
capabilities,  Including  monitoring  and  controlling  the  evolving  software  system 
and  forecasting  the  outcome  of  the  development.  Techniques  that  assist  In 
these  management  functions  may  lead  to  more  successful  projects,  and  possibly 
higher  product  requirement  conformance  and  operational  reliability.  The  Idea 
of  a  characteristic  software  metric  set  supports  several  aspects  of  software 
management. 

A  characteristic  software  metric  set  Is  a  concise  collection  of  measures  that 
capture  distinct  factors  In  a  software  development/modlflcatlon  environment.  A 
characteristic  metric  set  can  be  thought  of  as  a  vector  of  measures  that 
represents  different  areas  of  Importance  In  an  environment.  Since  both 
cost/quallty  goals  and  production  environments  differ,  the  particular  factors 
that  are  captured  by  the  metrics  In  the  set  will  differ  across  environments.  The 
calculation  of  a  characteristic  metric  set  should  be  based  on  the  particular  cost 
and  quality  goals  In  an  environment,  and  reflect  the  Inherent  differences  of  the 
environment  from  others. 

A  characteristic  metric  set  may  be  used  to  1)  characterize  an  environment, 
2)  compare  an  environment  with  others,  3)  monitor  current  project  status,  or  -4) 
forecast  project  outcome  relative  to  past  projects,  when  metrics  In  the  set  are 
available  early  in  development.  Once  the  distinct  factors  In  an  environment's 
set  are  determined,  the  set  then  characterizes  what  aspects  are  Important  In  the 
environment.  Comparing  the  characteristic  set  of  factors  In  one  environment 


with  the  sets  of  other  environments  provides  a  format  to  distinguish  and  con¬ 
trast  among  them.  Within  an  Individual  environment,  the  actual  values  of  the 
metrics  In  the  set  characterize  a  particular  project  or  project  subsystem.  The 
change  In  the  metric  values  during  a  project  can  be  used  to  monitor  project 
status  and  Its  change  over  time.  The  characteristic  set  In  conjunction  with  his¬ 
torical  data  can  be  used  to  forecast  the  outcome  of  the  current  project  relative 
to  past  project  performance. 

4.3. 1.1.  Investigation  Goals 

The  goals  for  this  study  are  threefold.  I.)  Develop  an  approach  for  custom¬ 
izing  a  set  of  measures  to  particular  cost/quality  goals  In  a  specific  environment, 
n.)  Apply  the  approach  to  calculate  the  characteristic  set  for  the  NASA/5EL 
environment,  in.)  Examine  the  usability  of  the  approach  as  a  management  tool 
for  predicting  outcome  of  system  parts.  An  application  of  the 
goal/questlon/metrlc  paradigm  [Baslll  &  Selby  84,  Baslll  &  Weiss  84]  leads  to 
the  framework  of  goals  and  questions  for  this  study  appearing  In  Figure  33. 

Figure  33.  Framework  of  goals  and  questions  for  characteristic  set  study. 

I.  Develop  an  approach  for  customizing  a  set  of  measures  to  particular 
cost/quallty  goals  In  a  particular  environment. 

A.  Is  the  approach  sensitive  to  different  cost  and  quality  goals? 

B.  Does  the  approach  capture  the  aspects  that  distinguish  a  given  environ¬ 

ment  from  others? 


II.  Calculate  the  characteristic  set  for  the  NASA/SEL  environment. 


A.  In  the  NASA/SEL  environment  of  projects  and  programmers,  which  dis¬ 

tinct  factors  are  Important? 

1.  What  Is  the  ordering  of  factors  that  reflects  their  Importance  In  the 

environment? 

2.  How  many  distinct  factors  are  there? 

B.  What  metrics  are  appropriate  for  the  various  factors  In  the  set? 

TTT.  Examine  the  usability  of  the  approach  as  a  management  tool  for  predicting 
outcome  of  system  parts. 

A.  In  the  NASA/SEL  environment  of  projects  and  programmers,  does 

determining  a  characteristic  metric  set  and  using  historical  data  en¬ 
able  one  to  Identify  which  modules  will  have  Interesting  attributes, 
such  as  high  total  development  effort? 

B.  What  are  the  best  single  Identifiers  of  Interesting  modules  when  the 

cost/quallty  aspect  considered  changes? 


4.3.2.  Empirical  Study 

This  section  describes  the  SEL  environment  examined  and  the  scheme  for 
data  collection. 

4. 3. 2.1.  SEL  Environment 

The  Software  Engineering  Laboratory  (SEL)  [Basil]  et  al.  77,  Basil]  &  Zel- 

kowltz  7S,  Card  et  al.  S2.  SEL  82]  Is  a  Joint  venture  between  the  University  of 
Maryland,  NASA/Goddard  Space  Flight  Center,  and  Computer  Sciences  Cor¬ 
poration.  The  purpose  of  the  SEL  has  been  to  provide  an  experimental  data¬ 
base  for  examining  relationships  among  the  factors  that  affect  the  software  de¬ 
velopment  process  and  the  delivered  product.  The  software  comprising  the  da¬ 


tabase  Is  ground  support  software  for  satellites.  The  six  systems  analyzed  In 


this  study  consisted  of  51,000  to  112,000  lines  of  FORTRAN  source  code,  and 
took  between  6900  and  22,300  man-hours  to  develop  over  a  period  of  9  to  21 
months.  There  are  from  200  to  800  modules  (e.g.,  subroutines)  In  each  system 
and  the  staff  size  ranges  from  8  to  23  people  per  project,  Including  the  support 
personnel.  Amywhere  from  10  to  61  percent  of  the  source  code  Is  reused  or 
modified  from  previous  projects. 

4.3. 2. 2.  Effort,  Change,  and  Fault  Data 

The  data  discussed  In  this  study  are  extracted  from  several  sources. 
Among  the  data  analyzed  are  the  effort  to  design,  code,  and  test  the  various 
modules  of  the  systems  as  well  as  the  changes  and  faults  that  occurred  during 
their  development.  Effort  data  were  obtained  from  a  collection  form  that  Is 
filled  out  weekly  by  all  programmers  on  the  project.  They  report  the  time  they 
spent  on  each  module  In  the  system  partitioned  Into  the  phases  of  design,  code, 
and  test,  as  well  as  any  other  time  they  spend  on  work  related  to  the  project, 
e.g.,  documentation,  meetings,  etc.  A  module  Is  defined  as  any  named  object  In 
the  system;  that  Is,  a  module  Is  either  a  main  procedure,  block  data,  subroutine 
or  function.  The  faults  and  changes  are  reported  on  another  data  collection 
form  that  is  completed  by  a  programmer  each  time  a  change  Is  made  to  the  sys¬ 
tem.  A  static  code  analysis  program  called  S.AP  [Decker  &  Taylor  82]  automati¬ 
cally  computed  several  of  the  static  metrics  examined  In  this  analysis. 


4.3.3.  Data  Analysis 


The  following  sections  present  the  analysis  and  results  from  this  study  bro¬ 
ken  down  by  the  goal  areas  outlined  earlier. 

4.3.3. 1.  Approach  for  Set  Calculation 

A  proposed  approach  for  calculating  a  characteristic  set  consists  of  three 
steps:  1)  formulate  the  goals  and  questions  that  represent  cost/quallty  factors  In 
an  environment;  2)  list  all  measures  that  capture  Information  relating  to  the 
goals;  and  3)  condense  measures  Into  a  set  capturing  distinct  factors.  This  ap¬ 
proach  satisfies  the  two  key  aspects  of  customizing  a  characteristic  metric  set  to 
an  environment:  sensitivity  to  the  cost/quality  goals  of  Importance  In  the  en¬ 
vironment,  and  capturing  the  features  that  give  the  environment  Its  identity. 

The  first  step  Is  to  generate  a  goal  and  question  framework  for  the  environ¬ 
ment  on  which  to  base  the  generation  of  all  potential  metrics.  After  the  goals 
and  questions  have  been  specified  for  an  environment,  all  possible  metrics  are 
listed  that  represent  relevant  information.  These  first  two  steps  are  an  applica¬ 
tion  of  the  goal/  question/  metric  paradigm  [Baslll  &  Weiss  84,  Baslll  &  Selby 
84].  Since  a  software  environment  Is  In  some  sense  defined  by  the  projects  It 
develops,  applying  the  metrics  listed  to  those  projects  reflects  an  environment's 
distinguishing  features.  The  third  step  Is  to  condense  the  collection  of  measures 
Into  a  characteristic  set.  Factor  analysis  may  be  applied  to  accomplish  this 
step.  This  data  reduction  task  actually  groups  the  metrics  listed  according  to 


how  they  relate  to  the  distinct  factors  In  an  environment.  Appropriate  metrics 


that  relate  to  each  of  the  factors  can  then  be  selected  based  on  some  criteria, 
such  as  ease  of  calculation  or  phase  availability. 

4.3. 3. 1.1.  An  Alternate  Approach 

An  alternate  approach  to  determining  a  small  set  of  characteristic  measures 
was  examined  In  [Elshoff  84].  In  this  approach,  twenty  candidate  complexity 
measures  were  calculated  on  585  PL/I  procedures.  The  name  of  each  procedure 
was  put  Into  a  large  "complexity  pot"  once  for  each  time  the  procedure  ap¬ 
peared  In  the  top  decile  of  a  candidate  complexity  measure.  Since  there  were 
twenty  candidate  measures,  the  name  of  a  given  procedure  could  then  appear 
up  to  twenty  times  In  the  pot.  The  procedures  Identified  by  a  single  measure 
were  then  compared  with  those  In  the  total  pot.  For  each  appearance  of  a  pro¬ 
cedure  name  In  the  total  pot,  a  candidate  measure  was  awarded  one  point  If 
that  name  was  In  the  measure's  top  decile.  The  candidate  complexity  measure 
that  scored  the  highest  would  be  selected  for  the  characteristic  set.  All  oc¬ 
currences  of  procedure  names  were  then  removed  from  the  pot  that  appeared  In 
the  top  decile  of  the  first  measure  selected.  The  scores  for  the  measures  were 
then  recalculated  based  on  the  remaining  procedures,  and  another  measure 
would  then  be  selected,  continuing  until  no  procedures  remained  In  the  pet. 

This  alternate  approach  suffers  because  of  the  biased  technique  used  to 
select  measures  In  the  characteristic  set,  and  a  troublesome  fundamental  as¬ 
sumption  In  the  calculation.  Including  a  large  number  of  highly  dependent  pro¬ 
gram  measures  in  the  collection  examined  (e.g.,  the  software  "quantity”  group 


of  executable  statements,  length,  volume,  vocabulary,  ...)  Increased  dispropor¬ 
tionately  the  number  of  appearances  of  routines  commonly  selected  by  that 
group  In  the  pot  of  “complex’'  programs.  It  Is  therefore  no  surprise  that  the 
measure  that  selected  the  greatest  percentage  of  the  appearances  In  the  pot  is 
one  member  of  the  “quantity”  group  (length).  In  each  of  the  twenty  program 
measures  examined,  the  top  decile  of  programs  was  chosen  as  the  most  complex 
according  to  that  measure.  This  decision  relied  on  the  Implicit  assumption  that 
software  complexity  is  a  monotonlcally  Increasing  function  of  each  of  the  meas¬ 
ures,  which  Is  possibly  troublesome. 

The  approach  presented  here  bases  the  selection  of  a  characteristic  set  of 
measures  on  aspects  of  cost  and  quality  In  an  environment.  The  use  of  meas¬ 
ures  In  the  characteristic  set  to  Identify  modules  with  particular  attributes,  such 
as  those  of  high  “complexity”  as  was  done  in  [Elshoff  8-4],  Is  discussed  In  "Use 
as  a  Management  Tool." 

4.3. 3. 2.  Application  in  the  SEL  Environment 

In  the  application  of  the  approach  In  the  NASA/SEL  environment,  there 
were  two  major  reasons  to  use  Just  six  recent  projects.  First,  changes  and  Im¬ 
provements  In  development  technologies  and  personnel  tend  to  be  reflected  In 
the  projects  developed  (as  they  are  Intended  to  be).  Therefore,  the  considera¬ 
tion  of  projects  not  recently  completed  would  not  be  representative  of  the 
current  environment.  Second,  several  development  environments  do  not  imve  a 
long  history  of  data  collection.  Discussing  an  approach  that  required  a  larg- 


;:1 


1 19 


*  r  ?  -r  *  rr 


project  database  would  have  little  utility  for  them. 


Three  goal  areas  were  defined  for  the  SEL  environment.  The  first  goal  area 
was  to  analyze  the  system  development  effort.  An  example  question  under  this 
goal  Is  “What  are  the  attributes  of  modules  that  result  In  high  development 
effort?".  The  second  goal  area  was  to  analyze  the  system  modifications.  An  ex¬ 
ample  question  here  Is  “What  are  the  attributes  of  modules  that  will  be  difficult 
to  change?".  Analyzing  the  system  faults  was  the  third  goal  area.  An  example 

I 

question  would  be  “What  are  the  attributes  of  modules  that  will  be  fault- 
prone?".  The  generated  list  of  measures  based  on  these  three  goal  areas  appears 
In  Figure  34;  a  total  of  65  measures  was  examined.  The  measures  are  grouped 
according  to  the  general  areas  of  slze/complexlty  [McCabe  76],  effort, 
faults/changes,  and  software  science  [Halstead  77].  The  set  notation  In  the 
figure  signifies  the  normalization  of  one  metric  by  another,  e.g.,  amount  of 
design  effort  was  considered  alone  and  normalized  by  the  amount  of  code  effort, 
testing  effort,  and  overhead  effort.  In  addition  to  being  examined  alone,  the 
effort  and  faults/changes  measures  were  In  general  normalized  over  the 
slze/complexlty  measures. 


Figure  3-4.  List  of  measures  examined  In  the  SEL  environment. _ 

Size/Complexity  Area _ 

source  lines  (SRC) 
executable  statements  (XQT) 
comments 
comments/SRC 
XQT/(SRC-comments) 

Cyclomatlc_complexlty 
Cyclomatlc_complexlty_2 
calls 

(Cyclomatlc_comwiexlty.  Cyclomatlc_complexlty_2}  over  (SRC.  XQTj 

Effort  Area _ 

total_effort 
deslgn_effort 
code_effort 
testlng_effort 

{deslgn_effort}  over  {code_effort,  testlng_effort,  overhead_effort} 
{code_effort}  over  {testlng_effort,  overhead_effort} 

{testlng_effort}  over  {overhead_effort} 

{deslgn_effort,  code_effort,  testlng_effort}  over  {totai_effort,  calls,  77/} 
(total_effort)  over  {SRC.  SRC-comments.  XQT,  calls,  770*) 

Faults/Changes  Area _ 

version 
total_changes 
welghted_changes 
totai_faults 
welghted_faults 

_{totals:Xaultsi;wel£hted_:^ults}_ove£_£SRCi:XQT^_____:_______ 

Software  Science _ 

r)t  77,  77/  N1 

N2/t72  N  N'  V  V*  L 

L*  l/L  1/L*  E  E*  E  E* 

B~ _ lambda _ E/SRC _ 

From  the  six  projects,  this  analysis  focuses  on  652  newly  developed 
modules  with  complete  data  for  the  measures  listed  In  Figure  3d.  The  use  of 
factor  analysis  Isolated  the  set  of  six  distinct  factors  Including.  In  order  of 
overall  Importance,  {size,  effort.  77/,  fault  density,  code  and  test  effort.. 


^changes}.  The  77/  metric  Is  the  number  of  I/O  parameters  In  a  module.  Some 


appropriate  measures  that  related  well  to  each  of  the  factors  In  the  set  were  a) 


size  -  source  lines,  executable  statements,  and  77 ^  (the  number  of  unique  opera¬ 
tors);  b)  effort  -  design  effort,  rj*,  and  testing  effort  /  7?/;  c)  77/  -  77/;  d)  fault 
density  -  #fau!ts  /  executable  statement;  e)  code  and  test  effort  -  code  effort, 
code  effort  /  ^subroutine  calls;  and  f)  ^changes  -  number  of  module  versions. 
Thus,  a  feasible  characteristic  metric  set  for  the  SEL  environment  Is  {source 
lines,  design  effort,  number  of  I/O  parameters,  fault  correction  effort  per  execut¬ 
able  statement,  code  effort,  number  of  versions}. 

4. 3. 3. 3.  Use  as  a  Management  Tool 

Although  a  characteristic  set  has  the  several  uses  outlined  earlier,  this 
study  focuses  on  the  use  of  measures  In  the  set  to  forecast  the  outcome  of 
modules  In  projects.  Several  studies  have  pointed  to  the  unsatisfactory  use  of 
metrics  as  direct  predictors  of  software  cost  and  quality  [Hamer  £  Frewln  82; 
Baslll,  Selby  £  Phillips  83;  Shen,  Conte  £  Dunsmore  83].  This  Inadequacy 
motivates  the  use  of  software  metrics  from  a  new  perspective  -  the  examination 
of  how  well  the  metrics  In  the  characteristic  set  can  Identify  system  parts  (or 
whole  systems)  resulting  In  high  or  low  cost/quallty.  System  parts  with  In¬ 
teresting  cost  or  quality  attributes  Include  those  with  high/low  development 
effort,  high/low  modification  effort,  or  high/low  fault  correction  effort. 

An  approach  for  using  metrics  to  Identify  system  parts  having  Interesting 
attributes  Is  as  follows.  First,  select  some  Interesting  cost  or  quality  aspect  of  a 
system  part,  such  as  the  total  development  effort  for  a  module.  Then,  choose  a 

122 


.■v-.--.-v-  -v.-v  IT"*' ,■*  ."V  '-■»  l/».ll  ■I'M  ■.» 


window  of  modules  that  Is  useful  to  Identify,  such  as  those  modules  that  will  be 
In  a  project's  upper  quartlle  of  development  effort.  Next,  determine  the  ranges 
of  metric  values  that  contained  modules  from  past  projects  ending  up  In  the 
upper  quartlle  of  development  effort.  From  the  calculation  of  the  sensitive 
metric  ranges  and  the  use  of  conditional  probabilities  from  historical  data,  this 
approach  Is  Intended  to  be  able  to  Identify  Interesting  modules  In  the  system. 

4.3. 3.3.1.  Conditional  Probabilities  from  Historical  Data 

The  conditional  probabilities  displayed  In  Figure  35  were  calculated  from 
six  SEL  projects,  and  are  Interpreted  as  follows.  The  table  Is  divided  Into  three 
sections,  corresponding  to  the  three  SEL  goal  areas  discussed  above.  There  Is  a 
table  section  for  each  dependent  variable:  total  module  development  effort,  total 
effort  for  module  modification,  and  total  effort  for  fault  correction  In  a  module. 
The  characteristic  set  of  six  metrics  that  represents  the  different  environmental 
factors  Is  listed  In  each  section  of  the  table.  Consider  the  section  on  total 
module  development  effort.  The  entries  In  the  table  are  the  probability  that  a 
module's  eventual  outcome  will  be  In  the  upper  quartlle  of  total  module  devel¬ 
opment  effort,  given  that  a  module  Is  currently  In  quartlle  Qj  of  metric  Mt  . 
For  example,  given  that  a  module  Is  In  the  upper  quartlle  of  code  effort.  It  has  a 
probability  of  .7-4  of  ending  up  In  the  upper  quartlle  of  total  module  develop¬ 
ment  effort.  A  module  In  the  third  quartlle  of  source  lines  has  a  probability  of 
Just  .14  of  ending  up  In  the  upper  quartlle  of  total  development  effort.  The  In¬ 
terpretation  Is  the  same  for  the  other  dependent  variables  of  module 


23 


modification  effort  and  module  fault  correction  effort.  Figure  36  Is  analogous  to 
Figure  35,  except  that  the  entries  are  the  conditional  probabilities  that  the 
eventual  outcome  will  be  in  the  lower  (Instead  of  the  upper)  quartlle  of  the 
respective  dependent  variable.  For  example,  a  module  In  the  lower  quartlle  of 
number  of  versions  has  a  probability  of  .50  of  ending  up  In  the  lower  quartlle  of 


Figure  35.  Conditional  probabilities  based  on  SEL  data: 

er  quartlles  of  dependent  variables. 


Dependent 

Variable 


Module 

Development 

Effort 


Module 

Modification 

Effort 


Module 

Fault 

Correction 

Effort 


Quartlle  of  Metric  A f{ 

Upper 

Second 

Third 

Lower 

.04 

code  effort 

.13 

design  effort 

.09 

source  lines 

.11 

Vs 

.06 

version 

.16 

fault  correction 
effort  /  XQT 

.85 

.18 

.08 

.09 

1 

fault  correction 
effort  /  XQT 

.52 

.33 

.11 

.04 

version 

.50 

.27 

.17 

.06 

code  effort 

.50 

.28 

.13 

.09 

source  lines 

Ao 

.24 

.23 

.08 

Vs* 

.41 

.25 

.18 

.17 

design  effort 

fault  correction 
effort  /  XQT 
version 
code  effort 
source  lines 

ns 

design  effort 


Q 


Figure  36.  Conditional  probabilities  based  on  SEL  data: 

lower  quartlles  of  dependent  variables. 


Dependent 

Variable 


Module 

Development 

Effort 


Quartlle  of  Metric  A/,- 

Upper 

Second 

Third 

Lower 

Characteristic  Set 
Metric  M; 


Module 

Modification 

Effort 


.00 

.23 

.77 

code  effort 

.12 

.24 

.54 

source  lines 

.14 

.30 

.50 

version 

.21 

.25 

.45 

Vs* 

.23 

.37 

.38 

design  effort 

.25 

.32 

.31 

fault  correction 
effort  /  XQT 

.15 

.28 

.13 

.43 

.19 

.25 

.18 

.30 

.18 

.34 

.28 

.27 

version 

fault  correction 

effort  /  XQT 

^  * 

Vs 

source  lines 
code  effort 
design  effort 


Module 

Fault 

Correction 

Effort 


fault  correction 
effort  /  XQT 
version 
source  lines 
code  effort 
Vs 

design  effort 


4. 3. 3. 3. 2.  Data  Interpretation 

The  Information  In  these  tables  could  be  used  to  forecast  the  outcome  of 
modules  In  a  system.  At  the  end  of  the  design  phase,  the  rjn  metric  and  the 


126 


amount  of  effort  spent  In  design  are  known.  The  modules  In  the  upper  quartlle 


of  design  effort  should  be  Identified  by  a  project  manager  because  these  modules 
have  a  probability  of  .56  of  ending  up  In  the  upper  quartlle  of  total  develop¬ 
ment  effort.  That  Is,  In  this  environment  the  modules  In  the  upper  quartlle  of 
design  effort  are  more  than  twice  (=.56/. 25)  as  likely  than  by  chance  to  be  the 
most  expensive  to  develop  overall;  these  modules  are  approximately  28 
(=.56/.02)  times  more  likely  to  be  In  the  upper  quartlle  of  total  development 
effort  than  to  be  In  the  lower  quartlle  of  total  development  effort.  Modules  In 
the  upper  quartlle  of  the  77/  metric  are  almost  twice  as  likely  than  by  chance  to 
require  the  most  effort  to  develop,  modify,  and  correct.  Other  observations  In¬ 
clude  l)  It  Is  easiest  to  Identify  those  modules  that  will  have  high  development 
effort;  2)  It  Is  most  difficult  to  Identify  those  modules  that  will  require  little 
fault  correction  effort;  and  3)  the  metrics  of  design  effort  and  77/  are  reasonably 
similar  In  forecasting  ability,  except  that  77/  Is  superior  In  Identifying  modules 
that  will  require  little  modification  effort. 

The  two  tables  above  help  characterize  the  SEL  development  environment. 
The  total  development  effort  for  a  module  tends  to  be  Indicated  by  the  module’s 
coding  effort  -  modules  In  the  extreme  quartlles  of  coding  effort  are  three  times 
more  likely  than  by  chance  to  be  In  the  corresponding  extreme  quartlles  of  total 
development  effort.  Since  the  programmers  In  the  SEL  are  quite  experienced  In 
the  application  and  with  appropriate  design  approa-hes.  the  dominance  of  cod¬ 
ing  effort  seems  reasonable.  In  other  environments,  the  amount  of  design  effort 
might  better  Indicate  the  total  development  effort  required.  Other  observations 


Include  1)  high  density  of  fault  correction  effort  (fault  correction  effort  per  exe¬ 
cutable  statement)  Indicates  high  total  modification  effort  and  high  total  fault 
correction  effort;  and  2)  an  extreme  (high  or  low)  number  of  program  versions 
reflects  a  corresponding  amount  of  modification  effort  and  of  correction  effort. 

Ideally,  the  metrics  In  the  characteristic  set  would  all  be  available  early  In 
development  and  have  strong  relationships  with  the  dependent  variables  of  In¬ 
terest.  Some  measures,  such  fault  correction  effort  per  executable  statement, 
have  limited  usefulness  as  a  forecaster  because  of  not  being  available  until  late 
In  project  development.  An  assumption  Is  needed  In  order  to  use  conditional 
probabilities  from  past  projects  to  forecast  the  outcome  of  modules  from  a 
current  project.  The  assumption  Is  that  the  relationship  between  a  module’s 
metric  value  (at  a  point  In  time)  and  Its  eventual  outcome  Is  the  same  as  the  re¬ 
lationship  between  the  metric  values  from  past  projects'  modules  (at  a 
corresponding  point  In  time)  and  their  eventual  outcome.  When  using  data 
|  based  on  recent  projects  that  were  similar  to  the  current  one,  this  assumption  Is 

reasonable.  Note  that  the  examples  and  conditional  probabilities  presented  are 
from  a  particular  environment,  project  data  from  other  environments  may 

I 

differ. 

Using  a  characteristic  metric  set  with  conditional  probabilities  from  past 
|  projects  enables  the  monitoring  of  a  small  set  of  customized  measures  to  fore¬ 

cast  the  outcome  of  the  current  project.  A  characteristic  set  Is  usable  as  a 


management  tool  as  soon  as  the  metrics  In  the  set  are  available. 


4.3.4.  Conclusions 


A  characteristic  software  metric  set  Is  Intended  to  help  support  the  effective 
management  of  software  development  and  modification.  The  approach  exam¬ 
ined  for  building  a  characteristic  metric  set  Is  adaptable  to  different  cost/quallty 
goals  and  to  different  environments.  The  calculation  and  use  of  the  set  could  be 
coupled  to  an  automated  project  monitor  and  database.  The  major  results  of 
this  study  are  l)  an  approach  has  been  described  for  customizing  a  characteris¬ 
tic  software  metric  set  to  an  environment;  2)  the  application  of  the  approach  to 
the  SEL  production  environment  yielded  the  characteristic  software  metric  set 
{source  lines,  design  effort,  number  of  I/O  parameters,  fault  correction  effort  per 
executable  statement,  code  effort,  number  of  versions};  and  3)  the  use  of  a 
characteristic  set  with  conditional  probabilities  from  historical  data  can  assist  in 
project  management  by  forecasting  the  outcome  of  system  parts.  This  work  Is 
Intended  to  advance  the  understanding  of  the  use  of  various  metrics  to  charac¬ 
terize  and  predict  aspects  of  software  cost  and  quality. 


29 


5.  Conclusions 


The  understanding  of  the  technologies  that  contribute  to  quality  In  the 
software  development  process  and  the  final  product  Is  fundamental  to  the  ad¬ 
vancement  of  the  software  field.  This  dissertation  presents  three  studies  that 
evaluate  factors  In  key  areas  of  software  development,  maintenance,  and 
management:  testing  strategies,  Cleanroom  software  development,  and  environ¬ 
mental  metrics. 

In  each  of  the  studies,  a  seven-step  approach  for  quantitatively  evaluating 
software  technologies  couples  software  methodology  evaluation  with  software 
measurement.  In  the  approach,  goal/questlon  frameworks  of  a  technology’s  po¬ 
tential  effect  on  software  cost  and  quality  are  coupled  with  measurable  attri¬ 
butes  and  appropriate  quantitative  analysis  methods.  The  seven-step  analysis 
methodology  provides  a  paradigm  for  quantitatively  assessing  the  effect  of  soft¬ 
ware  technologies  on  software  development  and  maintenance. 

The  goal  structure,  data  analysis,  and  conclusions  were  presented  for  three 
studies:  a  blocked  subject-project  study  comparing  software  testing  strategies,  a 
replicated  project  study  characterizing  the  effect  of  using  the  Cleanroom  soft¬ 
ware  development  approach,  and  a  multi-project  variation  study  to  determine  a 
characteristic  set  of  software  cost  and  quality  metrics.  The  different  studies 
were  chosen  to  satisfy  several  criteria:  scope  of  evaluation,  representative 
domain  sampling,  quantitative  analysis  method,  area  of  assessment,  scope  of 
technology,  and  potential  benefit.  The  three  studies  are  the  following.  11  Soft- 


ware  Testing  Strategies:  A  74-subject  study,  Including  32  professional  program¬ 
mers  and  42  advanced  university  students,  compared  code  reading,  functional 
testing,  and  structural  testing  In  a  fractional  factorial  design.  2)  Cleanroom 
Software  Development:  Fifteen  three-person  Learns  separately  built  a  1200-llne 
message  system  to  compare  Cleanroom  software  development  (In  which  software 
Is  developed  completely  off-line)  with  a  more  traditional  approach.  3)  Charac¬ 
teristic  Software  Metric  Sets:  In  the  NASA  S.E.L.  production  environment,  a 
study  of  65  candidate  product  and  process  measures  of  652  modules  from  six 
(51,000  -  112,000  line)  projects  yielded  a  characteristic  set  of  software 
cost/quallty  metrics. 

These  empirical  studies  are  Intended  to  demonstrate  an  analysis  methodolo¬ 
gy  In  a  variety  of  problem  domains  and  to  advance  the  understanding  of  l)  the 
contribution  of  various  software  testing  strategies  to  the  software  development 
process  and  to  one  another;  2)  the  relationship  between  Introducing  discipline 
Into  the  development  process  and  several  aspects  of  product  quality  (require¬ 
ment  conformance,  operational  reliability,  and  modifiable  source  code);  and  3) 
the  use  of  software  metrics  to  characterize  software  environments  and  to  predict 
project  outcome. 

5.1.  Overall  Results  from  the  Software  Technology  Evaluations 

The  major  results  from  the  software  technology  evaluations  are  the  follow¬ 
ing.  1)  With  the  professionals  programmers,  code  reading  detected  more  soft¬ 
ware  faults  and  had  a  higher  fault  detection  rate  than  did  functional  or  struc- 


tural  testing,  while  functional  testing  detected  more  faults  than  did  structural 
testing,  but  functional  and  structural  testing  were  not  different  In  fault  detec¬ 
tion  rate.  2)  With  the  advanced  students,  the  three  testing  techniques  were  not 
different  In  the  number  of  faults  detected  or  In  the  fault  detection  rate,  except 
that  structural  testing  detected  fewer  faults  than  did  the  others  In  one  study 
phase.  3)  Code  reading  detected  more  Interface  faults  and  functional  testing 
detected  more  control  faults  than  did  the  other  methods.  4)  Most  developers 
using  the  Cleanroom  software  development  approach  were  able  to  build  systems 
completely  off-line.  5)  The  Cleanroom  teams’  products  met  system  requirements 
more  completely  and  succeeded  on  more  operational  test  cases  than  did  those 
developed  with  a  traditional  approach.  6)  An  approach  described  for  calculat¬ 
ing  a  characteristic  metric  set  yielded  the  set  for  the  NASA  S.E.L.  environment 
{source  lines,  design  effort,  number  of  input/output  parameters,  fault  correction 
effort  per  executable  statement,  code  effort,  number  of  versions}. 

5.2.  Problem  Areas 

The  use  of  the  quantitative  approach  for  evaluating  software  technologies 
Identified  several  problem  areas  In  data  collection  and  analysis  In  software 
research  and  management,  suggesting  future  research  areas.  1)  The  process  of 
formulating  intuitive  problems  into  precisely  stated  goals  Is  a  nontrivial  task. 
The  Inherent  difficulty  in  goal  writing  reflects  the  uncertainty  of  all  aspects  of 
quality  In  the  software  product  and  development  process.  2)  Numerous  soft¬ 
ware  metrics  have  been  proposed  to  measure  distinct  attributes  of  software. 


These  metrics  need  to  be  validated  to  determine  whether  they  actually  capture 
what  is  Intended.  3)  The  process  of  collecting  accurate  data  Is  a  continuing 
challenge.  While  there  Is  Increasing  potential  In  automated  collection  schemes, 
the  more  common  data  collection  forms  are  subject  to  Incompleteness,  Incon¬ 
sistency,  and  human  error.  4)  With  the  growing  number  of  controlled  studies 
done  to  determine  which  factors  contribute  to  software  quality,  the  selection  of 
samples  (e.g.,  programmers,  programs,  ...)  to  analyze  Is  fundamental.  In  order 
for  the  results  of  these  studies  to  apply  to  larger  environments,  representative 
samples  of  sufficient  size  must  be  selected.  5)  These  controlled  studies  are  ex¬ 
pensive  to  conduct.  Both  Industry  and  academia  must  help  support  these 
efforts;  e.g.,  academic  researchers  using  subjects  from  Industry.  6)  There  seems 
to  be  an  Interdependency  among  several  factors  that  contribute  to  product  and 
process  quality.  The  use  of  several  techniques  together  may  be  effective  as  a 
"critical  mass”,  making  the  Isolation  of  their  Individual  effects  difficult.  7)  The 
methods  of  analysis  must  account  for  the  high  variation  In  Individual  perfor¬ 
mance.  Without  careful  planning,  this  many-to-one  differential  among  humans 
can  taint  experimental  results.  8)  Researchers  have  rarely  been  able  to  repro¬ 
duce  results  across  environments.  In  addition  to  the  lack  of  consistent  use  of 


measures,  every  software  development  or  modification  environment  seems  to 


5.3.  Overall  Conclusions 

The  quantitative  approach  for  evaluating  software  technologies  has  been 
applied  In  three  analyses  of  factors  contributing  to  software  quality.  The 
overall  conclusions  from  this  work  are  the  following.  1)  The  approach  described 
for  quantitatively  evaluating  software  technologies  has  been  demonstrated  and 
effective  In  a  variety  of  problem  domains.  2)  The  results  from  the  testing  stra¬ 
tegy  study  suggest  that  code  reading  by  stepwise  abstraction  (a  nonexecution- 
based  method)  Is  at  least  as  effective  as  on-line  functional  and  structural  testing 
In  terms  of  number  and  cost  of  faults  observed.  3)  The  results  from  the  Clean- 
room  study  demonstrate  the  feasibility  of  complete  off-line  development  (as  In 
Cleanroom)  and  suggest  that  such  a  development  approach  Is  superior  to  a  more 
traditional  approach.  4)  The  results  from  the  software  metric  study  suggest 
that  a  characteristic  metric  set  can  assist  In  aspects  of  project  management.  In¬ 
cluding  the  forecasting  of  effort  for  development,  modification,  and  fault  correc¬ 
tion  of  modules  based  on  historical  data. 


6.  Appendices 

6.1.  Appendix  A.  Overview  of  Sampling  and  Statistical  Test  Applica¬ 
tion 

In  the  range  of  software  analyses  In  the  four-part  classification  scheme 
presented  earlier,  there  Is  a  relationship  between  the  effectiveness  of  statistical 
methods  (attainable  statistical  significance)  and  the  representativeness  of  the 
sampled  observations  to  production-world  situations.  Because  of  this  observers 
sometimes  criticize  the  conclusions  of  an  analysis  or  express  doubt  as  to  how 
well  the  results  would  extrapolate  to  environments  different  that  the  ones  stu¬ 
died.  This  happens  even  when  the  analysis  presented  was  sound  and  statistical¬ 
ly  significant.  Emphasis  needs  to  be  placed  on  two  aspects  of  applying  statisti¬ 
cal  tools  In  an  analysis,  observation  sampling  and  statistical  test  application. 
When  an  experiment  Is  run,  a  certain  sampling  of  data  from  some  population  Is 
analyzed  to  achieve  some  result.  After  applying  a  statistical  test  to  attributes 
of  the  members  of  the  sample,  a  set  of  conclusions  Is  derived. 

The  major  considerations  when  choosing  a  sampling  from  a  population  are 
how  well  the  sampling  represents  the  whole  population  and  how  large  the  sam¬ 
pling  should  be.  If  a  population  Is  finite,  the  most  representative  sampling 
would  be  to  select  the  whole  population.  There  could  then  be  no  argument  that 
the  observations  studied  did  not  represent  the  whole  population.  Several  In¬ 
teresting  populations,  such  as  programmers  or  software  systems,  are  Infinite  so  a 
reasonable  finite  sampling  must  be  chosen.  Techniques  used  to  effectively 


choose  this  sample  are  In  statistical  sampling  theory  [Cochran  53].  The 


stratification  of  the  population  Is  an  Important  aspect  of  this  process;  that  Is, 
the  Identification  of  all  the  relevant  aspects  that  differentiate  among  members  of 
the  population.  This  set  of  aspects  is  then  distilled  Into  a  pseudo-basis  16  set, 
and  then  observations  are  chosen  along  the  range  of  each  basis  set  component. 
If  statistical  results  are  generated  from  a  finite  sampling  of  an  Infinite  popula¬ 
tion,  the  Issue  of  controversy  Is  usually  how  well  this  sampling  corresponds  to 
the  Intended  population.  One  component  of  the  representativeness  of  this  set  is 
Its  size. 

In  determining  the  sample  size,  both  the  achievement  of  statistical 
significance  In  the  experimental  design  and  the  economic  constraints  need  to  be 
considered.  The  cardinality  of  the  basis  set  determines  the  number  of  factors 
whose  effect  must  be  accounted  for  In  the  experimental  design.  The  effect  of 
these  factors  Is  blocked  out  In  the  design,  enabling  the  Investigation  to  focus  on 
distinguishing  between  the  particular  treatments  being  examined  [Box,  Hunter, 
&  Hunter  78].  The  decision  to  choose  an  experimental  design  Is  balanced 
between  one  capable  of  blocking  out  these  factors  and  the  need  to  keep  the 
sample  size  economically  feasible.  The  size  of  the  sample  also  effects  the  proba¬ 
bility  of  erroneous  conclusions,  referred  to  as  Type  I  and  Type  II  errors  [Siegel 
55,  pp.  8-11].  17 

16  The  prefix  pseudo  Is  used  here  since  the  basis  set  achieved  is  usually  an 
approximation  of  a  true  basis  set. 

11  Type  [  error  Is  rejecting  the  experimental  hypothesis  when  it  Is  Indeed 
true.  The  probability  of  Type  I  error  Is  the  significance  level,  usuallv  called  al- 


When  applying  a  statistical  test  to  a  set  of  data,  any  assumptions  that  the 


test  requires  must  be  verified.  For  example,  assumptions  regarding  the  distribu¬ 
tion  of  values  or  their  variance  commonly  occur  In  parametric  statistics.  Given 
that  a  set  of  data  meets  the  required  assumptions,  the  determination  of  the  out¬ 
come  of  the  test  Is  Just  mathematics;  criticizing  this  aspect  of  experimentation  Is 
unfounded.  Note  that  different  statistical  tests  have  their  own  characteristics, 
such  as  In  terms  of  the  power  or  sensitivity  of  the  test  [Siegel  55,  pp.  10-11].  18 
Given  that  the  assumptions  for  two  ilfferent  tests  are  both  met,  one  of  the  tests 
may  be  more  appropriate  to  be  chosen  on  these  or  other  grounds. 

6.2.  Appendix  B.  Programs  Used  in  the  Testing  Strategy  Comparis¬ 
on 

6.2.1.  Appendix  B.l.  The  Specifications  for  the  Programs 

Program  1 

Given  an  Input  text  of  up  to  80  characters  consisting  of  words  separated  by 
blanks  or  new-line  characters,  the  program  formats  It  Into  a  llne-by-llne  form 
such  that  1)  each  output  line  has  a  maximum  of  30  characters,  2)  a  word  In  the 
Input  text  Is  placed  on  a  single  output  line,  and  3)  each  output  line  Is  filled  with 
as  many  words  as  possible. 


pha.  Type  II  error  Is  not  rejecting  the  experimental  hypothesis  when  It  Is  false. 
The  probability  of  Type  II  error  Is  usually  called  beta. 


The  power  of  a  test  Is  one  minus  the  probability  of  Type  II  error. 


The  Input  text  Is  a  stream  of  characters,  where  the  characters  are  categor¬ 
ized  as  either  break  or  nonbreak  characters.  A  break  character  Is  a  blank,  a 
new-line  character  (&),  or  an  end-of-text  character  (/).  New-line  characters 
have  no  special  significance;  they  are  treated  as  blanks  by  the  program.  The 
characters  &  and  /  should  not  appear  In  the  output. 

A  word  Is  defined  as  a  nonempty  sequence  of  nonbreak  characters.  A  break 
Is  a  sequence  of  one  or  more  break  characters  and  Is  reduced  to  a  single  blank 
character  or  start  of  a  new  line  In  the  output. 

When  the  program  Is  Invoked,  the  user  types  the  Input  line,  followed  by  a 
/  (end-of-text)  and  a  carriage  return.  The  program  then  echos  the  text  input 
and  formats  It  on  the  terminal. 

If  the  Input  text  contains  a  word  that  Is  too  long  to  fit  on  a  single  output 
line,  an  error  message  Is  typed  and  the  program  terminates.  If  the  end-of-text 
character  Is  missing,  an  error  message  Is  Issued  and  the  program  awaits  the  In¬ 
put  of  properly  terminated  line  of  text. 

Program  2 

Given  ordered  pairs  (x,y)  of  either  positive  or  negative  Integers  as  Input, 
the  program  plots  them  on  a  grid  with  a  horizontal  x-axls  and  a  vertical  y-axls 
which  are  appropriately  labeled.  A  plotted  point  on  the  grid  should  appear  as 
an  asterisk  (*). 

The  vertical  and  horizontal  scaling  Is  handled  as  follows.  If  the  maximum 
absolute  value  of  any  y-value  Is  less  than  or  equal  to  twenty  (20),  the  scale  for 


138 


vertical  spacing  will  be  one  line  per  Integral  unit  (e.g.,  the  point  (3,6)  should  be 


plotted  on  the  sixth  line;  two  lines  above  the  point  (3,4)).  Note  that  the  origin 
(point  (0,0))  would  correspond  to  an  asterisk  at  the  the  Intersection  of  the  axes 
(the  x-axls  Is  referred  to  as  the  Oth  line).  If  the  maximum  absolute  value  of  any 
x-value  Is  less  than  or  equal  to  thirty  (30),  the  scale  for  horizontal  spacing  will 
be  one  space  per  Integral  unit  (e.g.,  the  point  (4,5)  should  be  plotted  four  spaces 
to  the  right  of  the  y-axls;  two  spaces  to  the  right  of  (2,5)).  However,  If  the  max¬ 
imum  absolute  value  of  any  y-value  Is  greater  than  twenty  (20),  the  scale  for 
vertical  spacing  will  be  one  line  per  every  (max  abs  of  yval)/20  rounded-up. 
(e.g.,  If  the  maximum  absolute  value  of  any  y-value  to  be  plotted  Is  66,  the 
vertical  line  spacing  will  be  a  line  for  every  four  (4)  Integral  units.  In  such  a 
data  set,  points  with  y-values  greater  than  or  equal  to  eight  and  less  than 
twelve  will  show  up  as  asterisks  In  the  second  line,  points  with  y-values  greater 
than  or  equal  to  twelve  and  less  than  sixteen  will  show  up  as  asterisks  In  the 
third  line,  etc.  Continuing  the  example,  the  point  (3,15)  should  be  plotted  on 
the  third  line;  two  lines  above  the  point  (3,5).)  Horizontal  scaling  Is  handled 
analogously. 

If  two  or  more  of  the  points  to  be  plotted  would  show  up  as  the  same  as¬ 
terisk  In  the  grid  (like  the  points  (9,13)  and  (9.15)  In  the  above  example),  a 
number  ’2’  (or  whatever  number  Is  appropriate)  should  be  printed  Instead  of  the 
asterisk.  Points  whose  asterisks  will  lie  on  a  axis  or  grid  marker  should  show  up 
In  place  of  the  marker. 


39 


*v  *  V—  TT» VWV  -  \ "  "V 


".^T.’VV*.  JV  /%  ,V  ,1 


Program  S 

A  list  Is  defined  to  be  an  ordered  collection  of  Integer  elements  which  may 
have  elements  annexed  and  deleted  at  either  end,  but  not  In  the  middle.  The 
operations  that  need  to  be  available  are  ADDFIRST,  ADDLAST, 
DELETEFIRST,  DELETELAST,  FIRST,  ISEMPTY,  LISTLENGTH,  RE¬ 
VERSE,  and  NEWLIST.  Each  operation  Is  described  In  detail  below.  The  lists 
are  to  contain  up  to  a  maximum  of  five  (5)  elements.  If  an  element  Is  added  to 
the  front  of  a  “full”  list  (one  containing  five  elements  already),  the  element  at 
the  back  of  the  list  Is  to  be  discarded.  Elements  to  be  added  to  the  back  of  a 
full  list  are  discarded.  Requests  to  delete  elements  from  empty  lists  result  In  an 
empty  list,  and  requests  for  the  first  element  of  an  empty  list  results  In  zero  (0) 

being  returned.  The  detailed  operation  descriptions  are  as  below: 
ADDFIRST(LIST  L,  INTEGER  I) 

Returns  the  list  L  with  I  as  Its  first  element  followed  by  all  the  elements  of 
L.  If  L  Is  “full"  to  begin  with,  L's  last  element  Is  lost. 

ADDLAST(LIST  L,  INTEGER  I) 

Returns  the  list  with  all  of  the  elements  of  L  followed  by  I.  If  L  Is  full  to 
begin  with,  L  Is  returned  (l.e.,  I  Is  Ignored). 

DELETEFIRST(LIST  L) 

Returns  the  list  containing  all  but  the  first  element  of  L.  If  L  Is  empty, 
then  an  empty  list  Is  returned. 

DELETELAST(LIST  L) 

Returns  the  list  containing  all  but  the  last  element  of  L.  If  L  Is  empty, 
then  an  empty  list  Is  returned. 

FIRST(LIST  L) 

Returns  the  first  element  In  L.  If  L  Is  empty,  then  It  returns  zero  (0). 
ISEMPTY(LIST  L) 

Returns  one  (l)  If  L  Is  empty,  zero  (0)  otherwise. 

LISTLENGTH(LIST  L) 

Returns  the  number  of  elements  In  L.  An  empty  list  has  zero  (0)  elements. 

NE\VLIST(LIST  L) 

Returns  an  empty  list. 


v: 


3 

Ss 


140 


J  A..,. 


-  f  -  ■  »  .  -  -  .  V 


-  .J 


REVERSE(LIST  L) 

Returns  a  list  containing  the  elements  of  L  In  reverse  order. 

Program  4 

(Note  that  a  'file'  Is  the  same  thing  as  an  EBM  ’dataset’.) 

The  program  maintains  a  database  of  bibliographic  references.  It  first 
reads  a  master  file  of  current  references,  then  reads  a  file  of  reference  updates, 
merges  the  two,  and  produces  an  updated  master  file  and  a  cross  reference  table 
of  keywords. 

The  first  input  file,  the  master,  contains  records  of  74  characters  with  the 
following  format: 
column  comment 


1-3  each  reference  has  a  unique  reference  key 
4-14  author  of  publication 
15  -  72  title  of  publication 
73  -  74  year  Issued 

The  key  should  be  a  three  (3)  character  unique  Identifier  consisting  of  letters 
between  A-Z.  The  next  Input  file,  the  update  file,  contains  records  of  75  charac¬ 
ters  In  length.  The  only  difference  from  a  master  file  record  Is  that  an  update 
record  has  either  an  ’A’  (capital  A  meaning  add)  or  a  'R’  (capital  R  meaning  re¬ 
place)  In  column  75.  Both  the  master  and  update  files  are  expected  to  be  al¬ 
ready  sorted  alphabetically  by  reference  key  when  read  Into  the  program.  Up¬ 


date  records  with  action  replace  are  substituted  for  the  matching  key  record  In 


the  master  file.  Records  with  action  add  are  added  to  the  master  file  at  the  ap¬ 
propriate  location  so  that  the  file  remains  sorted  on  the  key  field.  For  example, 
a  valid  update  record  to  be  read  would  be  (Including  a  numbered  line  Just  for 
reference) 

123456789012345678901234567890123456789012345678901234567890123456789012345 
BITbaker  an  Introduction  to  program  testing  83A 

The  program  should  produce  two  pieces  of  output.  It  should  first  print  the 
sorted  list  of  records  In  the  updated  master  file  In  the  same  format  as  the  origi¬ 
nal  master  file.  It  should  then  print  a  keyword  cross  reference  list.  All  words 
greater  than  three  characters  In  a  publication's  title  are  keywords.  These  key¬ 
words  are  listed  alphabetically  followed  by  the  key  fields  from  the  applicable 
updated  master  file  entries.  For  example.  If  the  updated  master  file  contained 
two  records, 

ABCkermlt  Introduction  to  software  testing  82 

EDXJones  the  realities  of  software  management  81 

then  the  keywords  are  Introduction,  testing,  realities,  software,  and  manage¬ 
ment.  The  cross  reference  list  should  look  like 

Introduction 

.ABC 

management 

DDX 


142 


realities 


DDX 

software 

ABC 

DDX 

testing 

ABC 

Some  possible  error  conditions  that  could  arise  and  the  subsequent  actions 
Include  the  following.  The  master  and  update  files  should  be  checked  for  se¬ 
quence,  and  If  a  record  out  of  sequence  Is  found,  a  message  similar  to  'key  ABC 
out  of  sequence’  should  appear  and  the  record  should  be  discarded.  If  an  up¬ 
date  record  indicates  replace  and  the  matching  key  can  not  be  found,  a  message 
similar  to  'update  key  .ABC  not  found'  should  appear  and  the  update  record 
should  be  Ignored.  If  an  update  record  Indicates  add  and  a  matching  key  Is 
found,  something  like  'key  ABC  already  In  file’  should  appear  and  the  record 
should  be  Ignored.  (End  of  specification.) 


6.2.2.  Appendix  B.2.  The  Source  Code  for  the  Programs 


Program  1 

001:  C  NOTE  THAT  YOU  DO  NOT  NEED  TO  VERIFY  THE  FUNCTION  'MATCH' 
002.  C  IT  IS  DESCRIBED  THE  FIRST  TIME  IT  IS  USED,  .AND  ITS  SOURCE  CODE 
003:  C  IS  INCLUDED  AT  THE  END  FOR  COMPLETENESS 
004:  C 

005:  C  NOTE  THAT  FORMAT  STATEMENTS  FOR  WRITE  STATEMENTS  INCLUDE 
A  LEADING 

006:  C  .AND  REQUIRED  '  '  FOR  CARRLAGE  CONTROL 
007 

008:  C  VAR  LAB  LE  USED  IN  FIRST.  BUT  NEEDS  TO  BE  INITIALIZED 
009:  INTEGER  MOREIN 


on:  C  STORAGE  USED  BY  GCHAR 

012:  INTEGER  BCOUNT 

013:  CHARACTER*  1  GBUFER(80) 

014:  CHARACTER  *  80  GBUF 

015:  C  GBUFER  AND  GBUFSTR  ARE  EQUIVALENCED 
016: 

017:  C  STORAGE  USED  BY  PCHAR 

018:  INTEGER  I 

019:  CHARACTER*  1  OUTLIN(31) 

020:  C  OUTLIN  AND  OUTLINST  ARE  EQUIVALENCED 
021: 

022:  CHARACTER  *1  GCHAR 

023: 

024:  C  CONSTANT  USED  THROUGHOUT  THE  PROGRAM 
025:  CHARACTER  *  1  EOTEXT,  BLANK.  LENEFD 

026:  INTEGER  MAXPOS 

027: 

028:  COMMON  /ALL/  MOREIN,  BCOUNT,  I,  MAXPOS.  OUTLIN. 

029:  X  EOTEXT,  BLANK.  LINEFD,  GBUFER,  GBUF 

030: 

031:  DATA  EOTEXT.  BLANK,  LINEFD,  MAXPOS  /  ’/’,  ’  ’,  31  / 

032: 

033: 

034:  CALL  FIRST 

035:  END 

036: 

037: 

038:  SUBROUTINE  FIRST 

039:  INTEGER  K,  FILL,  BUFPOS 

040:  CHARACTER*  1  CW 

041:  CHARACTER*  1  BUFFER(31) 

042: 

043:  INTEGER  MOREIN.  BCOUNT,  I,  MAXPOS 

044:  CHARACTER*  1  OUTLIN(31),  GCHAR,  EOTEXT,  BLANK,  LINEFD. 

045:  X  GBUFER(80) 

046:  CHARACTER  *  80  GBUF 

047: 

048:  COMMON  /ALL/  MOREIN,  BCOUNT.  I,  MAXPOS,  OUTLIN, 

049:  X  EOTEXT,  BLANK,  LINEFD.  GBUFER,  GBUF 

050: 

051:  BUFPOS  =  0 

052:  FILL  =  0 

053:  CW  =  ’  ’ 

054: 

055:  MOREIN  =  1 

056: 

057:  I  =  1 

058:  K  =  1 

059:  DOWHILE  (K  LE.  NLAXPOS) 

060:  OUTLIN(K)  =  '  ’ 

061:  K  =  K  -  1 

062:  ENDDO 

063 
064 


BCOUNT  =  1 


066: 

067: 

068: 

069: 

070: 

07 1: 
072: 
073: 
074: 
075: 
076: 
077: 
078: 
079: 
080: 
081: 
082: 
083: 
084: 
085: 
086: 
087: 
088: 
089: 
090: 
091: 
092: 
093: 
094: 
095:  10 
096: 
097: 
098: 
099. 
100: 
101: 
102: 
103: 
104: 
105: 
106: 
107: 
108: 
109: 
110: 
111: 
112: 
113: 
114: 
115: 
116: 

1 17: 
118: 


DOWHILE  (K  .LE.  80) 

GBUFER(K)  =  'Z* 

K  =  K  +  1 
ENDDO 

DOWHILE  (MOREEN') 

CW  =  GCHARQ 

IF  ((CW  .EQ.  BLANK)  .OR.  (CW  ,EQ.  LINEFD)  .OR. 
X  (CW  .EQ.  EOTEXT))  THEN 

IF  (CW  .EQ.  EOTEXT)  THEN 
MOREEN  =  0 
END  IF 

IF  ((FILL+l+E  TPOS)  .LE.  MAXPOS)  THEN 
CALL  Pc-iAR(BLANK) 

FILL  =  FILL  +  1 
ELSE 

CALL  PCHAR(LINEFD) 

FILL  =0 
END  IF 
K  =  1 

DOWHILE  (K  .LE.  BUFPOS) 

CALL  PCHAR(BUFFER(K)) 

K  =  K  4-  1 
ENDDO 

FILL  =  FILL  4-  BUFPOS 
BUFPOS  = 0 
ELSE 

IF  (BUFPOS  .EQ.  MAXPOS)  THEN 
WRITE(6,10) 

FORMATC  ','***WORD  TO  LONG***’) 
MOREEN  =  1 
ELSE 

BUFPOS  =  BUFPOS  +  1 
BUFFER(BUFPOS)  =  CW 
END  IF 
END  IF 
ENDDO 

CALL  PCHAR(LINEFD) 

END 


CHARACTER  *  1  FUNCTION  GCE\R() 

INTEGER  MATCH 
CHARACTER*80  GBUFSTR 

INTEGER  MOREIN,  BCOUNT,  I,  MAXPOS 
CHARACTER *1  OUTLEN(31).  EOTEXT,  BLANK.  LINEFD. 

X  GBUFER(80) 

CHARACTER*80  GBUF 

COMMON  /.ALL/  MOREIN.  BCOUNT.  I.  MAXPOS.  OUTLIN 
X  EOTEXT.  BLANK,  LINEFD.  GBUFER.  CBLT 


EQUIVALENCE  (GBUFSTR. GBUFER  I 


IF  (GBUFER(l)  .EQ.  ’Z’)  THEN 
READ(5,20)  GBUF 
FORMAT(A80) 


120: 

121: 

122:  20 
123:  C 

124:  C  MATCH(CARRAY,C)  RETURNS  1  IF  CHARACTER  C  IS  IN 
CHARACTER  ARRAY 

C  CARRAY,  RETURNS  0  OTHERWISE.  ARSIZE  IS  THE  SIZE  OF  CARRAY. 
C 

IF  (MATCH( G BUF ,E O TEXT )  .EQ.  0)  THEN 
WRITE(6,30) 

FORMATC  Y***NO  END  OF  TEXT  MARK***’) 
GBUFER(2)  =  EOTEXT 
ELSE 

GBUFER(l)  =  GBUF 
GBUFSTR  =  GBUF 
END  IF 
END  IF 

GCHAR  =  GBUFER(BCOUNT) 

BCOUNT  =  BCOUNT  +  1 
END 


125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 

136 

137 

138 

139 


30 


140 

141 

142 

143 

144 

145 

146 

147 

148 

149 

150 

151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 
169 


40 


SUBROUTINE  PCHAR  (C) 

CHARACTER*  1  C 

CHARACTER *31  SOUT.  OUTLINST 
INTEGER  K 

INTEGER  MORE  IN.  BCOUNT.  I.  MAXPOS 

CHARACTER*  1  OUTLIN(31).  GCHAR,  EOTEXT.  BLANK,  LINEFD. 
X  GBUFER(80) 

CHARACTER*80  GBUF 

COMMON  /ALL/  MO  RE  IN,  BCOUNT,  I,  MAXPOS,  OUTLIN. 

X  EOTEXT.  BLANK.  LINEFD,  GBUFER,  GBUF 

EQUIVALENCE  (OUTLINST.OUTLIN) 

IF  (C  .EQ.  LINEFD)  THEN 
SOUT  =  OUTLINST 
WRITE(6,40)  SOUT 
FORMATC  ’,A31) 

K  =  1 

DOWHILE  (K  .LE.  MAXPOS) 

OUTLIN(K)  =  '  ’ 

K  =  K  -  1 
ENDDO 
I  =  1 
ELSE 

OUTLIN(I)  =  C 
I  =  I  1 
ENDEF 
END 


Program  2 


l:  INT  WIDTH  =  30. 


2:  HEIGHT  =  20, 

3:  GRIDWD  =  61, 

4:  LARGENUM  =  100000000 

5:  STRING  TICKS  [61)  = 

6:  *|.  -|-  -|-  -|-  -I*  -I*  -I-  -I-  ‘I*  'I'  ‘I*  'I*  ‘I’ 

7 : 

8: 

9:  PROC  SORT  (INT  ARRAY  KEYBUF,  INT  ARRAY  FREEBUF,  INT  N) 

10: 

11:  INT  I,  MAXP 

12:  INT  ARRAY  SRTKEYB(IOO),  SRTFREEB(IOO) 

13: 

14:  I  :=  0 

13:  WHILE  I  <  N  DO 

16:  SRTKEYB(I)  :=  KEYBUF(I) 

17:  SRTFREEB(I)  :=  FREEBUF(I) 

18:  I  :=  I  +  1 

19:  END 

20: 

21:  I  :=  N 

22:  WHILE  I  >  0  DO 

23:  MAXP  :=  MAXELE(SRTKEYB.I) 

24:  KEYBUF(N-I)  :=  SRTKEYB(MAXP) 

25:  FREEBUF(N-I)  :=  SRTFREEB(MAXP) 

26:  CALL  REMOVE(SRTKEYB,MAXP,I) 

27:  CALL  REMOVE(SRTFREEB,MAXP,I) 

28:  I  :=  I  -  1 

29:  END 

30: 

31: 

32: 

33:  INT  FUNC  MAXELE  (INT  ARRAY  BUF,  INT  N) 

34: 

35:  INT  I,  MAXPTR,  MAX 

36: 

37:  MAXPTR  :=  -1 

38:  MAX  :=  -LARGENUM 

39:  I  :=  0 

40:  WHILE  I  <  N  DO 

41:  IF  BUF(I)  >  MAX 

42:  THEN 

43:  MAX  :=  BUF(I) 

44:  MAXPTR  :=  I 

45:  END 

46:  I  :=  I  -  1 

47:  END 

48:  RETURN(MAXPTR) 

49: 

50: 

51: 

52:  INT  FUNC  MINELE  (INT  ARRAY  BUF.  INT  N) 

53: 

54:  INT  I.  MIN'PTR.  MIN 

55: 

56:  MIN'PTR  :=-L 


57:  NON  LARGENUM 

58:  I  =  0 

59:  WHILE  I  <  N  DO 

60:  IF  BLT(I)  <  MIN 

61:  THEN 

62:  NON  =  BLT(I) 

63:  MINPTR  =  I 

64:  END 

65:  I  :=  I  +  1 

66:  END 

67:  RETURN(MINPTR) 

68: 

69: 

70: 

71:  PROC  REMOVE  (INT  ARRAY  BLT,  INT  PTR,  INT  N) 

72: 

73:  INT  I 

74: 

75:  I  :=  PTR 

76:  WHILE  I  <  N-l  DO 

77:  BUF(I)  :=  BUF(I+1) 

78:  I  :=  I  +  1 

79:  END 

80: 

81: 

82: 

83:  INT  FUNC  ABS  (INT  VAL) 

84: 

85:  IF  VAL  <  0 

86:  THEN 

87:  RETURN(-VAL) 

88:  ELSE 

89:  RETURN (VAL) 

90:  END 

91: 

92: 

93: 

94:  INT  FUNC  SLASH  (INT  TOP,  INT  BOT) 

95: 

96:  INT  RES 

97: 

98:  RES  :==  TOP/BOT 

99:  IF  TOP  <  >  RES*BOT  AND. 

100:  (TOP  >  0  .AND.  BOT  >  0  OR.  TOP  <  0  AND.  BOT  <  0) 

101:  THEN  RES  :=  RES  -r  1 

102:  END 

103:  RETUR  N  ( RES ) 

104: 

105:  INT  FUNC  MOD  (INT  N,  INT  M) 

106: 

107:  INT  VAL 

108: 

109:  VAL  :=  N-N  M*M 

110:  IF  VAL  <  0 

111:  THEN 


112:  VAL  :=  VAL  +  M 

113:  END 

114:  RETURN  (VAL) 

115: 

116: 

117:  PROC  MAIN 
118: 

119:  CHAR  ARRAY  GRID(61) 

120:  STRING  STR[6l] 

121:  INT  ARRAY  XVAL(IOO),  YVAL(IOO) 

122:  INT  I,  J.  NUMOBS,  MAXY,  MAXX,  MINX,  HORISP,  VERTSP,  VLENE 

123: 

124:  I  :=  0 

125:  WHILE  .NOT.  EOI  DO 

126:  READ(XVAL(I),YVAL(I)) 

127:  I  :=  1+  1 

128:  END 

129:  NUMOBS  :=  I 

130: 

131:  CALL  SORT(YVAL.XVAL, NUMOBS) 

132:  MAXY  :=  YVAL(O) 

133:  VERTSP  :=  SLASH(MAXY, HEIGHT) 

134: 

135:  MAXX  :=  XV AL(MAXELE(XVAL, NUMOBS)) 

136:  MINX  :=  XVaL(M3NELE(XVAL,NUMOBS)) 

137:  IF  AJBS(MINX)  >  ABS(MAXX) 

138:  THEN 

139:  HORISP  :=  SLASH(ABS(MINX), WIDTH) 

140:  ELSE 

141.  HORISP  :  =  SLASH(ABS(\1AXX) .WIDTH) 

142.  END 
143: 

144:  STR  :=  •  X  AXIS’ 

145:  WRITE(STR.SKIP) 

146:  I  :=  0 

147:  VLINE  :=  HEIGHT 

148:  WHILE  VLINE  >  0  DO 

149: 

150:  J  :=  0 

151:  IF  MOD(VLENE,5)  =  0 

152:  THEN 

153:  UNPACK(TICKS.GRED) 

154:  ELSE 

155:  WHILE  J  <  GRIDW'D  DO 

156:  GRED(J)  =  "  " 

157:  J  =  J  —  l 

158:  END 

159:  END 

160: 

161  \LIXE  :=  \LINE  -  1 

162: 

163  WHILE  VLINE*VERTSP  <'  'iT  Mill  DO 

164  IF  X\  .\L  1 1 )  >=  0 

165:  THEN 

166  GRID!  WIDTH  -  SLASH  XV  \L(  I l.HOR ISP ' I 


168:  GRID( WIDTH  -  SLASH(-XVAL(I),HORISP))  := 

169:  END 

170:  I  :=  I  +  1 

171:  END 

172: 

173:  GRID( WIDTH)  :=  "|” 

174:  PACK(GRED,STR) 

175:  WRITE(STR.SKIP) 

178:  END 

177: 

178:  STR  := 


179:  UNPACK(STR.GRID) 

180:  WHILE  0  <=  YVAL(I)  .AND.  I  <=  NUMOBS  DO 

181:  IF  XVAL(I)  >  =  0 

182:  THEN 

183:  GRID(WIDTH  +  SLASH(XVAL(I),HORISP))  := 

184:  ELSE 

185:  GRID(WTDTH  -  SLASH(-XV.AL(I).HORISP))  := 

186:  END 

187:  I  :=  I  +  1 

188:  END 

189: 

190:  PACK(GRID,STR) 

191:  WRITE(STR.SKIP) 

192:  STR  :=  1  Y  AXIS’ 

193:  WRITE(STR,SKIP) 

194: 

195:  ST.ART  MAIN 

Program  3 

001:  C  NOTE  THAT  YOU  DO  NOT  NEED  TO  VERIFY  THE  FUNCTIONS 
DRIVER,  GETARG, 

002:  C  CHAREQ,  CODE.  AND  PRINT.  THEIR  SOURCE  CODE  IS 
DESCRIBED  AND 

003:  C  INCLUDED  AT  THE  END  FOR  COMPLETENESS. 

004:  C  NOTE  THAT  FORMAT  STATEMENTS  FOR  WRITE  STATEMENTS 
INCLUDE  A  LEADING 

005:  C  AND  REQUIRED  ’  ’  FOR  CARRIAGE  CONTROL 
006:  C 

007:  INTEGER  POOL(7),  LSTEND 

008:  INTEGER  LISTSZ 

009:  C 

010:  COMMON  /ALL/  LISTSZ 

Oil:  C 
012:  C 

013:  LISTSZ  =  5 

014:  C.VLL  DRIVER  (POOL.  LSTEND) 

015:  STOP 

016:  END 

017  C 
018:  C 
019: 


FUNCTION  ADFRST  (POOL.  LSTEND.  I) 


020:  INTEGER  ADFRST 

021:  INTEGER  POOL(7),  LSTEND,  I 

022:  INTEGER  LISTSZ 

023:  COMMON  /ALL/  LISTSZ 

024:  C 

025:  INTEGER  A 

026:  C 

027:  IF  (LSTEND  .GT.  LISTSZ)  THEN 

028:  LSTEND  =  LISTSZ  -  1 

028:  END  IF 

030:  LSTEND  =  LSTEND  +  1 

031:  A  =  LSTEND 

032:  DO  WHILE  (A  ,GE.  1) 

033:  POOL(A-rl)  =  POOL(A) 

034:  A  =  A  -  1 

035:  ENDDO 

036:  C 

037:  POOL(l)  =  I 

038:  ADFRST  =  LSTEND 

039:  RETURN 

040:  END 

041:  C 

042:  C 

043:  FUNCTION  ADLAST  (POOL,  LSTEND,  I) 

044:  INTEGER  ADLAST 

045:  INTEGER  POOL(7),  LSTEND,  I 

046:  INTEGER  LISTSZ 

047:  COMMON  /ALL/  LISTS Z 

048:  C 

049.  IF  (LSTEND  ,LE.  LISTSZ)  THEN 

050:  LSTEND  =  LSTEND  +  1 

051:  POOL(LSTEND)  =  I 

052:  END  IF 

053:  ADLAST  =  LSTEND 

054:  RETURN 

055:  END 

056:  C 

057:  C 

058:  FUNCTION  DELFST  (POOL,  LSTEND) 

059:  INTEGER  DELFST 

060:  INTEGER  POOL(7),  LSTEND 

061:  INTEGER  LISTSZ 

062:  COMMON  /ALL/  LISTSZ 

063:  C 

064:  INTEGER  A 

065:  IF  (LSTEND  .GT.  1)  THEN 

066:  A  =  1 

067:  LSTEND  =  LSTEND  -  1 

068-  DOWHILE  (A  LE.  LSTEND) 

069:  POOL(A)  =  POOL(A-l) 

070:  A  =  A  -  1 

071:  ENDDO 

072:  END  IF 

073:  DELFST  =  LSTEND 

074:  RETURN 


END 


075: 

078:  C 
077:  C 

078:  FUNCTION  DELLST  (LSTEND) 

079:  INTEGER  DELLST 

080:  INTEGER  LSTEND 

081:  C 

082:  IF  (LSTEND  .GE.  1)  THEN 

083:  LSTEND  =  LSTEND  -  1 

084:  END  IF 

085:  DELLST  =  LSTEND 

086:  RETURN 

087:  END 

088:  C 

089:  C 

090:  FUNCTION  FIRST  (POOL,  LSTEND) 

091:  INTEGER  FIRST 

092:  INTEGER  POOL(7),  LSTEND 

093:  C 

094:  IF  (LSTEND  .LE.  1)  THEN 

095:  FIRST  =  0 

096:  ELSE 

097:  FIRST  =  POOL(l) 

098:  END  IF 

099:  RETURN 

100:  END 

101:  C 
102:  C 

103:  FUNCTION  EMPTY  (LSTEND) 

104:  INTEGER  EMPTY 

105:  INTEGER  LSTEND 

106:  C 

107:  IF  (LSTEND  .LE.  1)  THEN 

108:  EMPTY  =  1 

109:  ELSE 

110:  EMPTY  =  0 

111:  END  IF 

112:  RETURN 

113:  END 

114:  C 

115:  C 

116:  FUNCTION  LSTLEN  (LSTEND) 

1 17:  INTEGER  LSTLEN 

118:  INTEGER  LSTEND 

119:  C 

120:  LSTLEN  =  LSTEND  -  1 

121:  RETURN 

122:  END 

123:  C 

124:  C 

125:  FUNCTION  NEWLST  (LSTEND) 

126:  INTEGER  NEWLST 

127:  INTEGER  LSTEND 

128:  C 
129: 


NEWLST  =  0 


RETURN 

END 


130: 

131: 

132:  C 
133:  C 

134:  SUBROUTINE  REVERS  (POOL,  LSTEND) 

135:  INTEGER  POOL(7),  LSTEND 

136:  C 

137:  INTEGER  I,  N 

138:  C 

138:  N  =  LSTEND 

140:  I  =  1 

141:  DOWHILE  (I  .LE.  N) 

142:  POOL(I)  =  POOL(N) 

143:  N  =  N  -  1 

144:  1  =  1+1 

145:  ENDDO 

146:  RETURN 

147:  END 


Program.  4 

001:  C  NOTE  THAT  YOU  DO  NOT  NEED  TO  VERIFY  THE  ROUTINES 
DRIVER,  STREQ,  WORDEQ, 

002:  C  NXTSTR,  ARRCPY,  CHARPT,  BEFORE,  CHAREQ,  AND  WRDBEF. 

THEIR  SOURCE 

003:  C  CODE  IS  DESCRIBED  AND  INCLUDED  AT  THE  END  FOR 
COMPLETENESS. 

004:  C  NOTE  THAT  FORMAT  STATEMENTS  FOR  WRITE  STATEMENTS 
INCLUDE  A  LEADING 

005:  C  AND  REQUIRED  ’  ’  FOR  CARRIAGE  CONTROL 
006:  C  THE  SFORT  LANGUAGE  CONSTRUCT  MF  (EXPRESSION)’  BEGINS 
A  BLOCKED 

007:  C  IF-THEN[-ELSE]  STATEMENT,  AND  IT  IS  EQUIVALENT  TO 
THE  F77 

008:  C  ’IF  (EXPRESSION)  THEN’. 

009:  C 

010:  CALL  DRIVER 

Oil:  STOP 

012:  END 

013:  C 

014:  C 

015:  SUBROUTINE  MALNSB 

016:  C 

017:  LOGICAL*l  U$KEY(3),U$AUTH(ll).U$TITL(58),U$YEAR(2),U$ACTN(l) 

018:  LOGICAL*l  M$KEY(3),M$AUTH(ll),M$TITL(58),M$YEAR(2) 

019:  LOGICAL*l  ZZZ(3),  LASTUK(3),  LASTMK(3) 

020:  LOGICAL *1  STREQ,  CHAREQ,  BEFORE,  CHARPT 

021 :  INTEGER  I 

022:  C 

023  LOGICAL*l  \VORD(500,1 2).  REFKEYf  1000.3) 

024:  INTEGER  NUMWDS.  NUMREF.  PTR(500),  NEXT(1000) 

025:  COMMON  WORDS/  WORD,  REFKEY.  NUMWDS.  NUMREF.  PTR.  NEXT 

026:  C 

027  WRITE(6.290) 

028:  290  FOR\L\T('  LPDATED  LIST  OF  MASTER  ENTRIES’) 


029: 

030: 

031: 

032: 

033:  300 
034:  C 
035: 

036: 

037: 

038: 

039:  C 
040: 

041: 

042: 

043: 

044: 

045:  100 
046: 

047: 

048: 

049: 

050: 

051: 

052:  C 
053: 

054: 

055: 

056: 

057: 

058:  C 
059: 

060: 

061: 

062:  110 
063: 

064: 

065: 

066: 

067: 

068: 

069:  C 
070: 

071: 

072: 

073: 

074:  C 
075:  C 
076: 

077: 

078:  C 
079 
080: 

081: 

082: 

083:  C 


DO  300  I  =  1,  3 

LASTMK(I)  =  CHARPTC  ') 

LASTUK(I)  =  CHARPTC  ’) 

ZZZ(I)  =  CHARPTC  Z’) 

CONTINUE 

NTJMWDS  =  0 
NUMREF  =  0 

CALL  GETNM(M$KEY,M$AUTH,M$TITL,M$YEAR,LASTMK) 

CALL  GETNUP(U$KEY,U$AUTH,U$TITL  ,U$YEAR  ,U$ACTN,LASTUK) 

DOWHILE  ((.NOT.(STREQ(M$KEY,ZZZ,3)))  .OR. 

X  (.NOT.(STREQ(U$KEY,ZZZ,3)))  ) 

.IF  (STREQ(U$KEY,M$KEY,3)) 

.IF  (.NOT.(CHAREQ(U$ACTN(l),'R'))) 

WRITE(6,100)  U$KEY 

FORMAT(’  YKEY  \3A1,'  IS  ALREADY  IN  FILE’) 

END  IF 

CALL  OUTPUT(U$KEY,U$AUTH,U$TITL,U$YEAR) 

CALL  DICTUP(U$KEY,U$TITL,58) 

CALL  GETNM(M$KEY,M$AUTH,M$TITL.M$YEAR,LASTMK) 

CALL  GETNUP(U$KEY,U$AUTH,U$TITL .U$YEAR,U$ACTN,LASTUK) 
END  IF 

.IF  (BEFORE(M$KEY,3,U$KEY,3)) 

CALL  OUTPUT(M$KEY,M$AUTH.M$TITL,M$YEAR) 

CALL  DICTUP(M$KEY,M$TITL,58) 

CALL  G ETNM(M$KEY ,M$ AUTH ,M$ TITL  ,M$YEAR,LASTMK) 

END  IF 

.IF  (BEFORE(U$KEY,3,M$KEY,3)) 

.IF  (CHAREQ(U$ACTN(1),’R’)) 

\VRITE(6.110)  U$KEY 

FORMATC  '.'UPDATE  KEY  ’,3A1,’  NOT  FOUND') 

END  IF 

CALL  OUTPUT(U$KEY,U$AUTH,U$TITL  ,U$YEAR) 

CALL  DICTUP(U$KEY,U$TITL,58) 

CALL  GETNUP(U$KEY,U$AUTH,U$TITL  ,U$YEAR.U$ACTN,LASTUK:) 
END  IF 
ENDDO 

CALL  SRTWDS 
CALL  PRTWDS 
RETURN 
END 


SUBROUTINE  GETNM(KEY. AUTH, TITL, YE AR.LASTMK) 
LOGIC.AL*!  KEY(3),AUTH(ll).TITL(58),YEAR(2),L,ASTMK(3) 


LOGIC,AL*l  SEQ,  INLINE! 80) 
LOGIC.AL*  1  BEFORE,  CH.ARPT. 
LOGIC.AL*l  GO$NI.  GO^U 
COMMON  DRI\Y  GO$M,  GOlU 


CH.AREQ 


154 


085:  DOWHILE  (SEQ) 

086:  .IF  (GO$M) 

087:  C 

088:  C  READ  FROM  THE  MASTER  FILE 
089:  C 

090:  READ(10,200 ,END=299)  INLINE 

091:  ELSE 

092:  C 

093:  C  SEE  REMARK  ABOUT  THE  CHARACTER  LATER  IN  THE  ROUTINE. 
094:  C 

095:  INLINE(l)  =  CHARPT(’%’) 

096:  END  IF 

097:  200  FORMAT(80A1) 

098:  DO  210  I  =  1,  3 

099:  KEY(I)  =  INLINE(I) 

100:  210  CONTINUE 

101:  DO  220  I  =  1,  11 

102:  AUTH(I)  =  INLINE(3+I) 

103:  220  CONTINUE 

104:  DO  230  I  =  1,  58 

105:  TITL(I)  =  ENLINE(14+I) 

106:  230  CONTINUE 

107:  DO  240  I  =  1,  2 

108:  YEAR(I)  =  INLINE(72+I) 

109:  240  CONTINUE 

110:  C 

111:  C  A  METHOD  OF  SPECIFYING  END-OF-FILE  IN  A  FILE  IS  TO  PUT 
THE  CHARACTER  '%' 

112:  C  AS  THE  FIRST  CHARACTER  ON  A  LINE.  THE  DRIVER  USES  THIS 
FOR  MULTIPLE 

113:  C  SETS  OF  INPUT  CASES. 

114:  C 

115:  .IF  ((. NOT. (CHAREQ(KEY(1), ’%’)))  -AND. 

116:  X  (BEFORE(KEY.3,LASTMK,3))  ) 

117:  WRlTE(6,250)  KEY 

118:  250  FORMATf  7KEY  \3A1,’  OUT  OF  SEQUENCE’) 

119:  ELSE 

120:  CALL  ARRCPY(KEY,LASTMK,3) 

121:  SEQ  =  0 

122:  END  IF 

123:  .IF  (CHAREQ(KEY(1).’%’)) 

124:  SEQ  =  0 

125:  DO  270  I  =  1.  3 

126:  KEY(I)  =  CHARPT(’Z’) 

127:  270  CONTINUE 

128:  END  IF 

129:  ENDDO 

130:  RETURN 

131:  299  CONTINUE 

132:  GO$M  =  0 

133:  DO  260  1  =  1,3 

134:  KEY(I)  =  CHARPT(’Z’) 

135:  260  CONTINUE 
136:  RETURN 


SUBROUTINE  GETNUP(KEY,AUTH.TITL,YEAR  ACTN.LASTUK) 
LOGICAL*!  KEY(3)AUTH(ll).TITL(58).YEAR(2),ACTN(l),LASTUK(3) 


138:  C 
139:  C 
140: 

141: 

142:  C 

143:  LOGICAL*l  SEQ,  INLINE(80) 

144:  LOGICAL*l  BEFORE,  CHARPT,  CHARE Q 

145:  LOGICAL*l  GO$M,  GO$U 

146:  COMMON  /DRIV/  GO$M,  GO$U 

147:  C 

148:  SEQ  =  1 

149:  DOWHILE  (SEQ) 

150:  .IF  (GO$U) 

151:  C 

152:  C  READ  FROM  THE  UPDATES  FILE 
153:  C 

154:  READ(  1 1,200,END=299)  INLINE 

155:  ELSE 

156:  C 

157:  C  SEE  REMARK  ABOUT  THE  CHARACTER  '%’  LATER  IN  THE  ROUTINE 
158:  C 

159:  INLINE(l)  =  CHARPT(’%‘) 

160:  END  IF 

161:  200  FORMAT(80A1) 

162:  DO  210  I  =  1.  3 

163:  KEY(I)  =  ENLINE(I) 

164:  210  CONTINUE 

165:  DO  220  1  =  1,11 

166:  AUTH(I)  =  INLINE(3-tT) 

167:  220  CONTINUE 

168:  DO  230  I  =  1,  58 

169:  TITL(I)  =  INLINE(14+I) 

170:  230  CONTINUE 

171:  DO  240  I  =  1,  2 

172:  YEAR(I)  =  INLINE(72+I) 

173:  240  CONTINUE 

174:  ACTN(l)  =  INLINE(75) 

175:  C 

176:  C  A  METHOD  OF  SPECIFYING  END-OF-FILE  IN  A  FILE  IS  TO  PUT 
THE  CHARACTER 

177:  C  AS  THE  FIRST  CHARACTER  ON  A  LINE.  THE  DRIVER  USES  THIS 
FOR  MULTIPLE 
178:  C  SETS  OF  INPUT  CASES. 

179:  C 

180:  .IF  ((.NOT.(CHAREQ(KEY( l),,ct')))  AND. 

181:  X  (BEFORE(KEY,3,LASTUK,3))  ) 

182:  WRITE(6,250)  KEY 

183:  250  FORMATC  YKEY  ’.3A1,’  OUT  OF  SEQUENCE’) 

184:  ELSE 

185:  CALL  ARRCPY(KEY.LASTUK,3) 

186:  SEQ  =  0 

187:  END  IF 

188:  .IF  (CHAREQ(KEY(  l).,cc’)) 


190:  DO  270  I  =  1,  3 

191:  KEY(I)  =  CHARPT('Z') 

192:  270  CONTINUE 

193:  END  IF 

194:  ENDDO 

195:  RETURN 

196:  299  CONTINUE 

197:  GO$U  =  0 

198:  DO  260  1  =  1,3 

199:  KEY(I)  =  CHARPT(’Z’) 

200:  260  CONTINUE 
201:  RETURN 

202:  END 

203:  C 
204:  C 

205:  SUBROUTINE  OUTPUT(KEY .AUTH.TITL .YEAR) 

206:  LOGICAL *1  KEY(3),  AUTH(ll),  TITL(58),  YEAR(2) 

207:  C 

208:  WRITE( 6 , 200)  KEY,  AUTH,  TITL,  YEAR 

209:  200  FORMAT(’  ’,3A1,11A1,58A1,2A1) 

210:  RETURN 

211:  END 

212:  C 
213:  C 

214:  SUBROUTINE  PRTWDS 

215:  C 

216:  LOGICAL*l  WORD(500,12),  REFKEY(  1000,3) 

217:  INTEGER  NUMWDS,  NUMREF,  PTR(500),  NEXT(1000) 

218:  COMMON  /WORDS/  WORD,  REFKEY,  NUMWDS,  NUMREF,  PTR,  NEXT 

219:  C 

220:  C  THE  ABOVE  GROUP  OF  DATA  STRUCTURES  SIMULATES  A  LINKED 
LIST. 

221:  C  WORD(I,J)  IS  A  KEYWORD  -  J  RANGING  FROM  1  TO  12 
222:  C  REFKEY(PTR(I),K),K=1,3  IS  THE  FIRST  3  LETTER  KEY  THAT  HAS 
AS  A 

223:  C  KEYWORD  WOM)(I,J),J=l,12 

224:  C  REFKEY(NEXT(PTR(I)),K),K=1,3  IS  THE  SECOND  3  LETTER  KEY 
THAT  HAS 

225:  C  AS  A  KEYWORD  WORD(I,J),J=l,12 

226:  C  REFKEY(NEXT(NEXT(PTR(I))),K),K=l,3  IS  THE  THIRD  ...  ETC. 

227:  C  NEXT(J)  IS  EQUAL  TO  -1  WHEN  THERE  .ARE  NO  MORE  3  LETTER 
KEYS  FOR 

228:  C  THE  PARTICULAR  KEYWORD 
229:  C 

230:  INTEGER  I.  J 

231.  LOGICAL*!  FLAG 

232:  C 

233:  WRITE(6.200) 

234:  200  FORMATC  KEYWORD  REFERENCE  LIST-) 

235:  DO  210  I  =  1.  NUMWDS 

236:  FLAG  =  1 

237:  WRITEI6.220)  (WORD(I.J).J=1.12) 

238:  220  FORMATC  ’.12A1) 

239:  L.AST  =  PTR(I) 

210  DOWHILE  (FLAG) 


241:  WRITERS, 230)  (REFKEY(LAST,J),J=1,3) 

242:  230  FORMATC  Y  ’,3A1) 

243:  LAST  =  NEXT(LAST) 

244:  .IF  (LAST  .EQ.  -1) 

245:  FLAG  =  0 

246:  END  IF 

247:  ENDDO 

248:  210  CONTINUE 
240:  RETURN 

250:  END 

251:  C 
252:  C 

253:  SUBROUTINE  DICTUP(KEY,STR,STRLEN) 

254:  LOGICAL*!  KEY(3),  STR(120) 

255:  INTEGER  STRLEN 

256:  C 

257:  LOGICAL *1  WDLEFT,  FLAG,  OKLEN,  NEXTWD(120),  WORDEQ 

258:  INTEGER  LPTR,  NXTSTR,  LEN,  LAB,  I,  K 

259:  C 

260:  LOGICAL*l  WORD(500,12),  REFKEY(1000,3) 

261:  INTEGER  NUMWDS,  NUMREF,  PTR(500),  NEXT(IOOO) 

262:  COMMON  /WORDS/  WORD,  REFKEY,  NUMWDS,  NUMREF,  PTR,  NEXT 

263:  C 

264:  C  THE  ABOVE  GROUP  OF  DATA  STRUCTURES  SIMULATES  A 
LINKED  LIST. 

265:  C  WORD(I,J)  IS  A  KEYWORD  -  J  RANGING  FROM  1  TO  12 
266:  C  REFKEY(PTR(I),K),K— 1,3  IS  THE  FIRST  3  LETTER  KEY  THAT  HAS 
AS  A 

267:  C  KEYWORD  WORD(I,J),J=l,12 

268:  C  REFKEY(NEXT(PTR(I)),K),K— 1,3  IS  THE  SECOND  3  LETTER  KEY 
THAT  HAS 

269:  C  AS  A  KEYWORD  WORD(I,J),J=l,12 

270:  C  REFKEY(NEXT(NEXT(PTR(I))).K),K=1,3  IS  THE  THIRD  ...  ETC. 

271:  C  NEXT(J)  IS  EQUAL  TO  -1  WHEN  THERE  ARE  NO  MORE  3  LETTER 
KEYS  FOR 

272:  C  THE  PARTICULAR  KEYWORD 
273:  C 

274:  WDLEFT  =  1 

275:  LPTR  =  1 

276:  C 

277:  DO  WHILE  (WDLEFT) 

278:  FLAG  =  1 

279:  OKLEN  =  1 

280:  LEN  =  NXTSTR(STR, STRLEN, LPTR. NEXTWD. 120) 

281:  .IF  (LEN  .EQ.  0) 

282:  WDLEFT  =  0 

283:  END  IF 

284:  C 

285:  .IF  (LEN  ,LE.  2) 

286:  OKLEN  =  0 

287:  END  IF 

288:  C 

289:  .IF  (OKLEN) 

290:  I  =  1 

291:  DO  WHILE  ((I  LE.  NUMWDS).AND.  FLAG  ) 


.IF  (WORDEQ(NEXTWD,I» 

LAB  =  I 
FLAG  =  0 
END  IF 
1  =  1+1 
ENDDO 
.IF  (FLAG) 

NUMWDS  =  NUMWDS  +  1 
NUMREF  =  NUMREF  +  1 
DO  300  K  =  1,  12 

WORD(NUMWDS.K)  =  NEXTWD(K) 
CONTINUE 

PTR(NUMWDS)  =  NUMREF 
DO  310  K  =  1.  3 

REFKEY(NUMREF,K)  =  KEY(K) 
CONTINUE 
NEXT(NUMREF)  =  -1 
ELSE 

NUMREF  =  NUMREF  +  1 
DO  320  K  =  1,  3 

REFKEY(NUMREF.K)  =  KEY(K) 
CONTINUE 

NEXT(NUMREF)  =  PTR(LAB) 
PTR(LAB)  =  NUMREF 
END  IF 
END  IF 
ENDDO 

RETURN 

END 


SUBROUTINE  SRTWDS 
C 

LOGICAL*l  WORD(oOO,12).  REFKEY(1000,3) 

INTEGER  NUMWDS.  NUMREF,  PTR(500),  NEXT(IOOO) 

COMMON  /WORDS/  WORD,  REFKEY,  NUMWDS,  NUMREF,  PTR,  NEXT 
C 

C  THE  ABOVE  GROUP  OF  DATA  STRUCTURES  SIMULATES  A 
LINKED  LIST. 

C  WORD(I.J)  IS  A  KEYWORD  -  J  RANGING  FROM  1  TO  12 
C  REFK£Y(PTR(I),K),K  =  1.3  IS  THE  FIRST  3  LETTER  KEY  THAT  HAS 
AS  A 

C  KEYWORD  WORD(I.J), J=l,12 

C  REFKEY(NEXT(PTR(I)),K),K  =  1.3  IS  THE  SECOND  3  LETTER  KEY 
THAT  HAS 

C  .AS  A  KEYWORD  WORD(I,  J),J=1,12 

C  REFKEY(NEXT(NEXT(PTR(I))),K).K  =  1,3  IS  THE  THIRD  ...  ETC. 

C  NEXT(J)  IS  EQU.AL  TO  -1  WHEN  THERE  .ARE  NO  MORE  3  LETTER 
KEYS  FOR 

C  THE  PARTICULAR  KEYWORD 
C 

INTEGER  I,  J,  K,  LAB,  LOWERB,  UPPERB 
LOGICAL*!  WRDBEF,  NEXTWDI 12) 

C 


343:  UPPERB  =  NUMWDS  -  1 


344:  DO  400  1  =  1,  UPPERB 

345:  LOWERB  =  1  +  1 


346: 

347: 

348: 

349: 

350:  300 
351: 

352: 

353: 

354:  310 
355: 

356: 

357: 

358:  320 
359: 

360: 

361:  410 
362:  400 


DO  410  J  =  LOWERB,  NUMWDS 
.IF  (WRDBEF(J.I)) 

DO  300  K  =  1,  12 

NEXTWD(K)  =  WORD(I.K) 
CONTINUE 
LAB  =  PTR(I) 

DO  310  K  =  1,  12 

WORD(I,K)  =  WORD(J.K) 
CONTINUE 
PTR(I)  =  PTR(J) 

DO  320  K  =  1,  12 

WORD(J,K)  =  NEXTWD(K) 
CONTINUE 
PTR(J)  =  LAB 
END  IF 
CONTINUE 
CONTINUE 


363:  C 


364:  RETURN 

365:  END 


6.3.  Appendix  C.  Operational  Testing  Procedure  Applied  in  the 
Cleanroom  Study 

This  section  describes  the  operational  testing  process  applied  to  the  pro¬ 
jects  In  the  Cleanroom  empirical  Investigation.  After  consulting  the  references 
[Thayer,  Llpow  &  Nelson  78,  Duran  &  Ntafos  81,  Dyer  82a,  Dyer  82a,  Dyer 
82b].,  the  following  procedure  was  adopted  to  meet  the  particular  cir¬ 
cumstances. 


6.3.1.  Test  Data  Selection 

The  first  step  In  the  test  data  generation  process  Is  to  define  the  operation¬ 
al  profile  of  the  system.  An  Initial  attempt  to  define  the  operational  Inputs  to 
the  message  system  and  their  serialization  requirements  resulted  In  the  regular 


expression  In  Figure  37. 


Figure  37.  Regular  expression  of  logical  inputs  to  the  system  in  a  single  user 
session. 


s 1 gnon 


send 

group_send 

(read,  And)  (respond,  hold,  delete) 

reset 

names 

groupquery 
add_us  e  r 
remove_user 
author  1 ze_user 
add_group 
remove_group 
lnval Id 


s 1 gnoff 


This  then  led  to  a  transition  diagram  of  functional  paths  through  the  system. 
There  were  distinct  transition  arcs  In  the  diagram  to  correspond  with  distinct 
functional  states  of  the  system.  The  system  states  were  described  as  either  sys¬ 
tem  processing  or  operating  states.  A  distinction  In  the  processing  of  data  that 
Is  transparent  to  the  user  Is  a  system  "processing  state"  (e.g.,  whether  or  not  a 
target's  queue  Is  empty).  A  distinction  In  the  processing  of  data  that  a  user  was 
directly  responsible  for  Is  a  system  "operating  state”  (e.g.,  giving  an  incorrect 
password).  The  arcs  leaving  a  given  node  were  each  assigned  a  frequency  such 
that  the  total  of  all  outgoing  arcs  from  a  given  node  was  one.  This  frequency 
assignment  was  accomplished  by  polling  eleven  well-seasoned  users  of  the 
University  of  Maryland  Vax  11/780  mailing  system.  Now  that  each  path 


through  the  system  had  a  (subjective)  probability,  the  schedule  of  presentation 


(Note  that  since  each  team  could  chose  Its  own  delivery  schedule,  the  test  gen¬ 
eration  process  needed  to  be  reevaluated  for  each  team.)  The  graph  from  Figure 
37  has  been  cut  and  reconstructed  according  to  the  groupings  of  capabilities 
given  In  Figure  38.  Be  aware  that  the  probability  for  any  given  path  through 
the  system  Is  preserved  in  such  a  process.  The  newly  created  arcs  for  the 


Pc 


162 


S*. 


groups  are  labeled  with  the  probabilities  that  a  given  system  Input  will  invoke 


any  function  In  a  particular  group  (see  bottom  of  Figure  39).  Notice  that  for 
any  sample  size  of  less  than  200,  the  expected  value  of  the  number  of  test  cases 
Invoking  a  ‘'privileged''  function,  group  B_0,  would  be  less  than  one.  Since  the 
relative  Input  frequency  for  the  “privileged”  group  of  commands  Is  dispropor¬ 
tionate  to  their  Importance  (you  would  not  be  able  to  build  and  maintain  a  net¬ 
work  of  subscribers  without  them),  a  separate  schedule  for  testing  them  Is  creat¬ 
ed.  Figure  39  shows  the  two  schedules  for  testing. 


Figure  89.  Two  Testing  Schedules  for  a  Sample  Team. 
Release 

Funct ion  Group 


Schedule  I 


B_00* 

B_0 

B_1 

B_2 

B_3 

_ u 

1 

— 

NX 

*“  r  “  “ 

1 

- “t - 

1 

— "1 - 

-  ,  1  - 

1 

_ 4_„ _ 

o 

i — 

XX 

“i —  " 

I 

XX 

■  -  -“1 —  “ 

1 

--f—  - 

_  _4 _ _ 

_ 4_ _ 

1 

-  -4- 

Schedu 

n — 

1  e 

II 

— 1 —  ■ 

_  _  _ 

_  „4 _ _ 

_ _ _  _ _ 

-  -4- 

3 

i — 

1 

i . 

XX 

— 1 —  - 

1 

XX 

-  "  ■*  i  “  “ 

! 

1 

1 

_  . 

XX 

_  .4 _ _ _ 

4 

i 

1 

i . 

XX 

•T"  * 

1 

XX 

_  _  _  .  . 
H — 

1 

XX 

T 

_ L__  . 

- — 

XX 

! 

— 

-  - 1 

5 

i 

I- 

XX 

— | —  - 

1 

1  — i —  * 

XX 

1 

- h-  ■ 

XX 

- . 

XX 

|  XX 

1 

1 

1 

Ope  rat  1 ona 1 

frequency  .005  .129  .686  .ISO  1.000 

•  Note  that  the  functions  In  B_00  are  Implicitly  tested  In 
all  test  cases. 


Since  Schedule  I  Is  a  special  case,  first  consider  Schedule  II  with  the  func¬ 
tion  group  probabilities  at  the  bottom  of  the  columns.  In  order  to  accomplish 
the  concentration  of  test  cases  on  the  newly  released  functions,  each  functional 
group  Is  assigned  a  relative  weighting  that  It  should  have  In  the  test  subset 
selections  for  each  of  the  releases.  The  weights  In  each  of  the  columns  should 
sum  to  one. 

Re  1  ease 


3 

4 

5 


The  weights  In  a  given  column  are  then  multiplied  by  the  total  associated  fre¬ 
quency  for  that  functional  group  at  the  bottom  of  the  column. 

Release 


3 

-4 

5 


The  entries  (not  the  weights)  In  the  above  table  are  the  probabilities  that  an  ln- 


Func  1 1  on  Group 


BO  B_1  B  2  B  3 


1/3 

.001 

0 

0.0 

I 

1/2 

.343 

1 

-4-- 

0 

0.0 

1/3 

.002 

2/3 

.086 

"  I  ” 

1 

1 

1/3 

.  229 

0 

0.0 

1/3 

.002 

""T* 

-+- 

1/3 

.043 

“  1  “ 

I 

1/6 

.  114 

"  i 

1 

1 

.  180 

.005  .129  .086  .180 


Funct Ion  Group 


BO  Bl  B  2  B  3 


1/3 

1  o 

- h . 

1  1/2 

_ L_ _ _ 

- + . 

1  o 

_ _l _ _ 

1/3 

1  2/3 

_ _ _ 

1  1/3 

_ 1 _ 

1  0 

..  _ _ I _ 

1/3 

1 1/3 

- 1 - 

1 1/6 

- 4- . 

1  1 

.005  .129  .086  .180 


put  will  be  selected  to  test  a  function  In  a  given  group  on  a  given  release.  Sum- 


mlng  the  rows  horizontally  reflects  the  total  distribution  across  the  releases. 
Release  Function  Group 

BO  B  1  B  2  B  3 


3 

1 

1/3 

.001 

-+- 

1 

! _ 

0 

0.0 

-+- 

! 

i 

1/2 

.343 

-+- 

0 

0.0  j 

.344 

4 

1/3 

.002 

n — 

I 

1 

2/3 

.086 

"”l — 

1 

..  _L_ 

1/3 

.229 

1 

0 

0.0  I 

.317 

5 

! 

1/3 

.002 

i 

-+- 

1/3 

.043 

1 

1 

-+- 

1/8 

.114 

— 1 — 

I 

-+- 

1 

.180  j 

.339 

.005  .129  .088  .180  1.000 


The  above  process  represents  the  partitioning  on  the  Input  frequencies  for  the 
various  functional  groupings  by  release. 

Hopefully  at  some  time  we  will  be  able  to  specify  the  size  of  our  test  sam¬ 
ple  from  the  reliability  goals  of  the  project.  For  the  purposes  of  testing  these 
projects,  our  experience  has  led  us  to  choose  a  test  sample  size  of  100  cases  per 
project.  If  ten  of  these  cases  (arbitrarily)  will  be  used  in  the  testing  of  Schedule 
I,  ninety  will  remain  for  Schedule  II.  Multiplying  ninety  by  the  frequencies  In 
the  right  hand  column  of  the  above  table  for  Schedule  n  led  to  the  sample  sizes 
of  31,  29  and  30  test  cases  for  releases  3,  4  and  5,  respectively.  The  above  pro¬ 
cess  has  been  undertaken  to  test  the  expanding  system  capabilities,  while  con¬ 
centrating  on  newly  released  functions  and  maintaining  the  composite  Input  dis¬ 
tribution.  Figure  40  summarizes  the  results  from  this  stratification  process.  In 
testing  release  1,  only  the  slgnon  and  slgnoff  functions  (group  B_00)  were  avail¬ 
able  and  hence  only  one  test  case  Is  needed.  The  remaining  nine  test  cases  are 
applied  to  release  2  to  test  the  group  of  "privileged"  functions  B_0.  The  arcs 


Figure  40.  Arc  Frequency  Assignment  as  a  Result  of  Stratification. 
Release  Arc  Frequency  Ass  1 gnment  for  #Test 


Function  Group 

Cases 

B_00* 

B_0 

B_1 

B_2 

B_3 

1 

1.0 

0.0 

0.0 

0.0 

0.0 

1 

2 

0.0 

1.0 

0.0 

0.0 

0.0 

9 

3 

0.0 

.003 

0.0 

.997 

0.0 

31 

4 

0.0 

.006 

.271 

.723 

0.0 

29 

5 

0.0 

.006 

.127 

.336 

.531 

30 

100 

♦Recall  that  the  functions  In  B_00  are  Implicitly  tested  In 
all  test  cases. 


After  this  moderately  complex  procedure,  test  data  can  finally  be  created. 
With  respect  to  the  revised  arc  frequency  assignments  above,  a  set  of  test  data 
of  the  appropriate  size  Is  randomly  generated  for  each  release. 

6.3.2.  Testing  Process  and  Failure  Observation 

The  actual  testing  process  consists  of  three  phases  for  each  test  case:  sys¬ 
tem  "state'’  setup  (recall  the  system  processing  and  operating  "states" 
described  earlier),  executing  the  actual  test,  and  verifying  the  result  of  the  test. 
Since  our  concern  In  the  reliability  analysis  Is  with  fal'ure-free  execution  Inter¬ 
vals,  the  cpu-tlme  for  Just  the  second  phase,  the  actual  test  case  execution.  Is 
Included  In  our  calculation. 

! 


166 


The  projects  developed  were  tested  Interactively,  with  each  given  test  case 
having  one  of  four  possible  outcomes.  If  the  system  performed  to  expectations 
on  the  test  case,  the  outcome  was  a  ’success.’  If  the  system’s  performance  did 
not  meet  expectations,  the  outcome  was  a  ’failure’  and  was  rated  according  to 
severity:  1  -  product  Inoperable,  2  -  major  function  In  the  product  Inoperable,  3 
-  some  part  of  a  major  function  Inoperable,  or  4  -  cosmetic  type  failure.  If  the 
outcome  was  a  ’failure’  but  the  same  failure  was  observed  on  an  earlier  test  case 
In  this  release,  the  outcome  was  termed  a  ’duplicate  failure.’  Finally,  If  the  test 
case  was  not  able  to  be  executed  because  we  were  unable  to  create  the  proper 
system  "state”  (on  account  of  failures  In  this  release),  the  outcome  of  the  test 
case  was  ’deferred.’  Test  cases  with  outcomes  of  ’failure,'  'duplicate  failure,’  or 
’deferred’  were  Included  In  the  test  set  of  the  next  release. 

6.3.3.  Failure  Counting 

Several  software  reliability  models  are  based  on  a  product's  history  of 
failure-free  execution  Intervals  [Jellnskl  &  Moranda  73,  Dyer  &  Mills  82,  Goei 
82].  In  order  to  calculate  these  Intervals,  a  consistent  Interpretation  of  what 
constitutes  a  failure  must  be  determined.  A  method  of  "sorting"  or  masking 
failures  by  associated  product  release,  product  function  or  by  failure  severity 
has  been  recognized  [Dyer  8 1  ] .  This  technique  enables  calculation  of  reliability 
estimates  for  certain  functions  within  a  system,  Including  only  those  failures 
worse  than  a  certain  severity,  etc.  In  addition  to  these  options,  a  more  funda¬ 
mental  set  of  questions  needs  to  be  considered.  Such  as,  whether  or  not  dupll- 


Figure  J^l.  Failure  Counting  Issues. 

Always  Include  cpu  time  In  failure-free  Interval  for  (unless  masked) 
successful  non-regression  tests 
first  occurrence  of  distinct  failures 
Never  Include  cpu  time  for 

deferred  test  cases 

Options: 

A.  Include  cpu  time  from  regression  tests: 

1.  Just  from  successful? 

2.  Just  from  failed? 

B.  Duplicate  failures: 

1.  Include  duplicate  failures  observed  In  the  same  release? 

.  2.  Include  duplicate  failures  observed  In  later  releases? 

C.  Execution  Interval  that  terminated  with  end  of  testing  (assuming  did 

not  end  with  a  failure): 

1.  discard? 

2.  Include  as  failure-free  execution  Interval  --  treat  end  of  testing  as 

a  failure? 

3.  Include  as  failure-free  execution  Interval  of  twice  the  length  -- 

treat  end  of  testing  as  a  failure  twice  as  far  off? 

D.  Masking: 

1.  by  testing  schedule? 

2.  by  product  release? 

3.  by  product  function? 

4.  by  failure  severity? 


7.  References 


[Adams  84] 

E.  N.  Adams,  Optimizing  Preventive  Service  of  Software  Products, 
IBM  Journal  of  Research  and  Development  28,  1,  pp.  2-14,  Jan.  84. 

[Albln  8c  Ferreol  82] 

J.-L.  Albln  and  R.  Ferreol,  Collecte  et  analyse  de  mesures  de  loglclel 
(Collection  and  Analysis  of  Software  Data),  Technique  et  Science  In - 
formatiques  1,  4,  pp.  297-313,  1982.  (Ralro  ISSN  0752-4072) 

[Bailey  8c  Baslll  81] 

J.  W.  Bailey  and  V.  R.  Baslll,  A  Meta-Model  for  Software  Develop¬ 
ment  Resource  Expenditures,  Proc.  Fifth  Int.  Conf.  Software  Engr., 
San  Diego,  CA,  pp.  107-118,  1981. 


[Bailey  84] 

J.  W.  Bailey,  Teaching  Ada:  A  Comparison  of  Two  Approaches, 
Dept.  Com.  Scl.,  Unlv.  Maryland,  College  Park,  MD,  working  paper, 
1984. 

[Baker  72a] 

F.  T.  Baker,  System  Quality  Through  Structured  Programming, 
AFIPS  Proc.  1972  Fall  Joint  Computer  Conf.  41,  pp.  339-343,  1972. 


[Baker  72b] 

F.  T.  Baker,  Chief  Programmer  Team  Management  of  Production 
Programming,  IBM  Systems  J  11,  1,  pp.  131-149,  1972. 


[Baker  81] 

F.  T.  Baker,  Chief  Programmer  Teams,  pp.  249-254  In  Tutorial  on 
Structured  Programming:  Integrated  Practices,  ed.  V.  R.  Baslll  and 
F.  T.  Baker,  IEEE,  1981. 

[Baslll  et  al.  85] 

V.  R.  Baslll.  E.  E.  Katz,  N.  M.  Panllllo-Yap,  C.  L.  Ramsey,  and  S. 
Chang,  A  Quantitative  Characterization  and  Evaluation  of  a  Soft¬ 
ware  Development  In  Ada,  (to  appear  IEEE  Computer ,  September 
1985) 

[Baslll  Sc  Turner  78] 

V.  R.  Baslll  and  A.  J.  Turner,  SIMPL-T  .4  Structured  Programming 
Language,  Paladin  House  Publishers.  Geneva,  IL.  1978. 


[BaslU  et  al.  77] 

V.  R.  Baslll,  M.  V.  Zelkowltz,  F.  E.  McGarry,  R.  W.  Reiter,  Jr.,  W. 
F.  Truszkowskl,  and  D.  L.  Weiss,  The  Software  Engineering  Labora¬ 
tory',  Software  Eng.  Lab.,  NASA/Goddard  Space  Flight  Center, 
Greenbelt,  MD,  Rep.  SEL-77-001,  May  1977. 

[Baslll  Sc  Zelkowltz  78] 

V.  R.  Baslll  and  M.  V.  Zelkowltz,  Analyzing  Medium-Scale  Software 
Developments,  Proc.  Third  Int.  Conf.  Software  Engr.,  Atlanta,  GA, 
pp.  116-123,  May  1978. 


[Baslll  80] 

Victor  R.  Baslll,  Tutorial  on  Models  and  Metrics  for  Software 
Management  and  Engineering,  IEEE  Computer  Society,  New  York, 
1980. 

[Baslll  Sc  Freburger  81] 

V.  R.  Baslll  and  K.  Freburger,  Programming  Measurement  and  Esti¬ 
mation  In  the  Software  Engineering  Laboratory,  Journal  of  Systems 
and  Software  2,  pp.  47-57,  1981. 

[Baslll  Sc  Weiss  81] 

V.  R.  Baslll  and  D.  M.  Weiss,  Evaluation  of  a  Software  Require¬ 
ments  Document  By  Analysis  of  Change  Data,  Proc.  Fifth  Int.  Conf. 
Software  Engr.,  San  Diego,  CA,  pp.  314-323,  March  9-12,  1981. 

[Baslll  Sc  Reiter  81] 

V.  R.  Baslll  and  R.  W.  Reiter,  A  Controlled  Experiment  Quantita¬ 
tively  Comparing  Software  Development  Approaches,  IEEE  Trans. 
Software  Engr.  SE-7,  May  1981. 

[Baslll  Sc  Doerfllnger  83] 

V.  R.  Baslll  and  C.  Doerfllnger,  Monitoring  Software  Development 
Through  Dynamic  Variables,  Proc.  COMPSAC,  Chicago,  IL,  1983. 

[Baslll,  Selby  Sc  Phillips  83] 

V.  R.  Baslll,  R.  W.  Selby,  Jr.,  and  T.  Y.  Phillips,  Metric  Analysis 
and  Data  Validation  Across  FORTRAN  Projects,  IEEE  Trans.  Soft¬ 
ware  Engr.  SE-9,  0,  pp.  652-663,  Nov.  1983. 

[Baslll  Sc  Hutchens  83] 

V.  R.  Baslll  and  D.  H.  Hutchens,  An  Empirical  Study  of  a  Syntactic 
Metric  Family,  Trans.  Software  Engr.  SE-9,  6.  pp.  664-672,  Nov. 
1983. 


170 


[Baslll  &  Perrlcone  84] 

V.  R.  Baslll  and  B.  T.  Perrlcone,  Software  Errors  and  Complexity: 
An  Empirical  Investigation,  Communications  of  the  ACM  27 ,  l,  pp. 
42-52,  Jan.  1984. 

[Baslll  &  Selty  84] 

V.  R.  Baslll  and  R.  W.  Selby,  Jr.,  Data  Collection  and  Analysis  In 
Software  Research  and  Management,  Proceedings  of  the  American 
Statistical  Association  and  Biometric  Society  Joint  Statistical  Meet¬ 
ings,  Philadelphia,  PA,  August  13-10,  1984. 

[Baslll  &  Ramsey  84] 

V.  R.  Baslll  and  J.  R.  Ramsey,  Structural  Coverage  of  Functional 
Testing,  Dept.  Com.  Scl.,  Unlv.  Maryland,  College  Park,  Tech.  Rep. 
TR-1442,  Sept.  1984. 

[Baslll  &  Weiss  84] 

V.  R.  Baslll  and  D.  M.  Weiss,  A  Methodology  for  Collecting  Valid 
Software  Engineering  Data*,  Trans.  Software  Engr.  SE-10,  6,  pp. 
728-738,  Nov.  1984. 


[Behrens  83] 


C.  A.  Behrens,  Measuring  the  Productivity  of  Computer  Systems  De¬ 
velopment  Activities  with  Function  Points,  IEEE  Trans.  Software 
Engr.  SE-9,  0,  pp.  048-851,  Nov.  1983. 


[Boehm  81] 


B.  W.  Boehm,  Software  Engineering  Economics,  Prentice-Hall,  En¬ 
glewood  Cliffs,  NJ,  1981. 


[Boehm  et  al.  84]  . 

B.  W.  Boehm,  T.  E.  Gray,  and  T.  Seewaldt,  Prototyping  Versus 
Specifying:  A  Multiproject  Experiment,  IEEE  Trans.  Software  Engr. 
SE-10,  3,  pp.  290-303,  May  1984. 


[Bowen  84] 


J.  Bowen,  Estimation  of  Residual  Faults  and  Testing  Effectiveness, 
Seventh  Minnowbrook  Workshop  on  Software  Performance  Evalua¬ 
tion,  Blue  Mountain  Lake,  NY,  July  24-27,  1984. 


[Box.  Hunter,  &  Hunter  78] 

G.  E.  P.  Box.  W.  G.  Hunter,  and  J.  S.  Hunter.  Statistics  for  Experi¬ 
menters,  John  Wiley  &  Sons.  New  York.  1978. 


[Brooks  80] 

R.  E.  Brooks,  Studying  Programmer  Behavior:  The  Problem  of  Prop¬ 
er  Methodology,  Communications  of  the  ACM  23,  4,  pp.  207-213, 
1080. 

(Brooks  81] 

W.  D.  Brooks,  Software  Technology  Payoff:  Some  Statistical  Evi¬ 
dence,  J.  Systems  and  Software  2,  pp.  3-9,  1981. 

[Buck  8l] 

F.  O.  Buck,  Indicators  of  Quality  Inspections,  IBM  Systems  Products 
Division,  Kingston,  NY,  Tech.  Rep.  21.802,  Sept.  1981. 

[Callllau  &  Rubin  79] 

R.  Callllau  and  F.  Rubin,  ACM  Forum:  On  a  Controlled  Experiment 
In  Program  Testing,  Communications  of  the  ACM  22,  pp.  687-8, 
Dec.  1979. 

[Card  et  al.  82] 

D.  N.  Card,  F.  E.  McGarry,  J.  Page,  S.  Esllnger,  and  V.  R.  Basil], 
The  Software  Engineering  Laboratory,  Software  Eng.  Lab., 
NASA/Goddard  Space  Flight  Center,  Greenbelt,  MD  Rep.  SEL-81- 
104,  Feb.  1982. 

[Chen  78] 

E.  T.  Chen,  Program  Complexity  and  Programmer  Productivity, 
IEEE  Trans.  Software  Engr.,  pp.  187-194,  May  1978. 

[Church  84] 

V.  Church,  Benchmark  Statistics  for  the  VAX  11/780  and  the  IBM 
4341,  Computer  Sciences  Corporation,  Silver  Spring,  MD,  Internal 
Memo,  1984. 

[Cochran  &  Cox  50] 

W.  G.  Cochran  and  G.  M.  Cox.  Experimental  Designs.  John  Wiley  Sc 
Sons.  New  York,  1950. 

[Cochran  53] 

W.  G.  Cochran,  Sampling  Techniques,  John  Wiley  Sc  Sons,  Inc., 
1953. 

[Currlt  S3] 

P.  A.  Currlt.  Cleanroom  Certlllcatlon  Model,  Proc.  Eight  Ann.  Soft¬ 
ware  Engr.  Workshop ,  NASA/GSFC,  Greenbelt,  MD.  Nov.  1983. 


[Curtis  et  al.  79] 

B.  Curtis,  S.  B.  Sheppard,  P.  Mllilman,  M.  A.  Borst,  and  T.  Love, 
Measuring  the  Psychological  Complexity  of  Software  Maintenance 
Tasks  with  the  Halstead  and  McCabe  Metrics,  IEEE  Trans.  Software 
Engr.,  pp.  96-104,  March  1979. 

[Curtis,  Sheppard  &  Mllilman  79] 

B.  Curtis,  S.  B.  Sheppard,  and  P.  M.  Mllilman,  Third  Time  Charm: 
Stronger  Replication  of  the  Ability  of  Software  Complexity  Metrics 
to  Predict  Programmer  Performance,  Proc.  Fourth  Int.  Conf.  Soft¬ 
ware  Engr.,  pp.  356-360,  Sept.  1979. 

[Curtis  83] 

B.  Curtis,  Cognitive  Science  of  Programming,  Sixth  Minnowbrook 
Workshop  on  Software  Performance  Evaluation,  Blue  Mountain 
Lake,  NY,  July  19-22,  1983. 

[Decker  &  Taylor  82] 

W.  J  Decker  and  W.  A.  Taylor,  FORTRAN  Static  Source  Code 
Analyzer  Program  (SAP)  User's  Guide  (Revision  1),  Software  Eng. 
Lab.,  NASA/Goddard  Space  Flight  Center,  Greenbelt,  MD,  Rep. 
SEL-78-102,  May  1982. 

[Duran  &  Ntafos  81] 

J.  W.  Duran  and  S.  Ntafos,  A  Report  on  Random  Testing*,  Proc. 
Fifth  Int.  Conf.  Software  Engr.,  San  Diego,  CA,  pp.  179-183,  March 
9-12,  1981. 

[Dyer  81] 

M.  Dyer,  Cleanroom  Project  Management  Data,  EBM-FSD  Internal 
Memo  to  H.  D.  Mills,  October  16,  1981. 

[Dyer  82a] 

M.  Dyer,  An  Approach  to  Statistical  Testing  for  Cleanroom  Software 
Development.  IBM-FSD  Tech.  Rep.  86.0002,  1982. 

[Dyer  &  Mills  82] 

M.  Dyer  and  H.  D.  Mills,  Developing  Electronic  Systems  with 
Certifiable  Reliability,  Proc.  NATO  Conf.,  Summer,  1982. 

[Dyer  82a] 

M.  Dyer,  Major  System  Mode  ll  (MSMU)  Testing,  EBM-FSD  Inter¬ 
nal  Memo  to  H.  D.  Mills.  May  18,  1982. 


[Dyer  82b] 


M.  Dyer,  Top-Down  Random  Testing,  IBM-FSD  Internal  Memo  to 
H.  D.  Mills,  June  21,  1982. 


[Dyer  82c] 

M.  Dyer,  Cleanroom  Software  Development  Method,  IBM  Federal 
Systems  Division,  Bethesda,  MD,  October  14,  1982. 

[Dyer  83] 

M.  Dyer,  Software  Validation  In  the  Cleanroom  Development 
Method,  IBM-FSD  Tech.  Rep.  88.0003,  August  19,  1983. 


[Elshoff  84] 

J.  L.  Elshoff,  Characteristic  Program  Complexity  Metrics,  Proc. 
Seventh  Int.  Con}.  Software  Engr.,  Orlando,  FL.  pp.  288-293,  1984. 


[Endres  75] 

A.  Endres,  An  Analysis  of  Errors  and  their  Causes  In  Systems  Pro¬ 
grams,  IEEE  Trans.  Software  Engr.,  pp.  140-149,  June  1975. 


[Fagan  78] 

M.  E.  Fagan,  Design  and  Code  Inspections  to  Reduce  Errors  In  Pro¬ 
gram  Development,  IBM  Sys.  J.  15,  3,  pp.  182-211,  1976. 

[Ferrentlno  &  Mills  77] 

A.  B.  Ferrentlno  and  H.  D.  Mills,  State  Machines  and  Their  Seman¬ 
tics  In  Software  Engineering,  Proc.  IEEE  COMPSAC,  1977. 

[Feuer  &  Fowlkes  79] 

A.  R.  Feuer  and  E.  B.  Fowlkes,  Some  Results  from  an  Empirical 
Study  of  Computer  Software,  Proc.  Fourth  Int.  Conf.  Software 
Engr.,  pp.  351-355,  1979. 


[Foster  80] 

K.  A.  Foster,  Error  Sensitive  Test  Cases,  IEEE  Trans.  Software 
Engr.  SE-6,  3.  pp.  258-284.  1980. 

[Gaffney  &  Heller  80] 

J.  E.  Gaffney  and  G.  L.  Heller,  Macro  Variable  Software  Models  for 
Application  to  Improved  Software  Development  Management.  Proc. 
Workshop  on  Quantitative  Software  Models  for  Reliability,  Complexi¬ 
ty  and  Cost .  IEEE  Comput.  Society,  1980. 


[Gannon  <fc  Horning  75] 

J.  D.  Gannon  and  J.  J.  Horning,  The  Impact  of  Language  Design  on 
the  Production  of  Reliable  Software,  Trans.  Software  Engr.  SE-1, 
pp.  179-191,  1975. 

[Gannon  77] 

J.  D.  Gannon,  An  Experimental  Evaluation  of  Data  Type  Conven¬ 
tions,  Communications  of  the  ACM  20,  8,  pp.  584-595,  1977. 

[Gannon  et  al.  83] 

J.  D.  Gannon,  E.  E.  Katz,  and  V.  R.  Baslll,  Characterizing  Ada  Pro¬ 
grams:  Packages,  The  Measurement  of  Computer  Software  Perfor¬ 
mance,  Los  Alamos  National  Laboratory,  Aug.  1983. 

[Gloss-Soler  79] 

S.  A.  Gloss-Soler,  The  DACS  Glossary:  A  Bibliography  of  Software 
Engineering  Terms,  Data  &  Analysis  Center  for  Software.  Grlfflss 
Air  Force  Base,  NY  13441,  Rep.  GLOS-l,  Oct.  1979. 

[Goel  82] 

A.  L.  Goel,  Software  Reliability  and  Estimation  Techniques,  Rome 
Air  Development  Center,  NY,  Rep.  RADC-TR-82-263,  October  1982. 

[Goel  83] 

A.  L.  Goel,  A  Guidebook  for  Software  Reliability  Assessment,  Dept. 
Industrial  Engr.  and  Operations  Research,  Syracuse  Unlv..  New 
York,  Tech.  Rep.  83-11,  April  1983. 

[Goodenough  &  Gerhart  75] 

J.  B.  Goodenough  and  S.  L.  Gerhart,  Toward  a  Theory  of  Test  Data 
Selection,  IEEE  Trans.  Software  Engr.,  pp.  156-173,  June  1975. 

[Gould  &  Drongowskl  74] 

J.  D.  Gould  and  P.  Drongowskl,  An  Exploratory  Study  of  Computer 
Program  Debugging,  Human  Factors  16,  3,  pp.  258-277,  1974. 

[Gould  75] 

.  J.  D.  Gould,  Some  Psychological  Evidence  on  How  People  Debug 
Computer  Programs,  International  Journal  of  Man-Machine  Studies 
7,  pp.  151-182,  1975. 

[Halstead  77] 

M.  H.  Halstead,  Elements  of  Software  Science.  North  Holland.  New 

York,  1977. 


[Hamer  &  Frewln  82] 

P.  G.  Hamer  and  G.  D.  Frewln,  M.  H.  Halstead's  Software  Science  -- 
A  Critical  Examination,  Proc.  Sixth  Int.  Conf.  Software  Engr.,  Tok¬ 
yo,  Japan,  pp.  197-208,  Sept  13-18,  1982. 


[Hetzel  76] 

W.  C.  Hetzel,  An  Expermental  Analysis  of  Program  Verification 
Methods,  Ph.D.  Thesis,  Unlv.  of  North  Carolina,  Chapel  Hill,  1978. 


[Hoare  89] 

C.  A.  R.  Hoare,  An  Axiomatic  Basis  for  Computer  Programming, 
Communications  of  the  ACM  12,  10,  pp.  576-583,  Oct.  1969. 

[Howden  78] 

W.  E.  Howden,  Reliability  of  the  Path  Analysis  Testing  Strategy, 
IEEE  Trans.  Software  Engr.  SE-2,  3,  Sept.  1976. 

[Howden  78] 

W.  E.  Howden,  Algebraic  Program  Testing,  Acta  Informatica  10, 
1978. 

[Howden  80] 

W.  E.  Howden,  Functional  Program  Testing,  IEEE  Trans.  Software 
Engr  SE-6,  pp.  162-169,  Mar.  1980. 

[Howden  Si] 

W.  E.  Howden,  A  Survey  of  Dynamic  Analysis  Methods,  pp.  209-231 
In  Tutorial:  Software  Testing  &  Validation  Techniques,  2nd  Ed.,  ed. 
E.  Miller  and  W.  E.  Howden,  1981. 

[Hutchens  &  Basil!  83] 

D.  H.  Hutchens  and  V.  R.  Baslll,  System  Structure  Analysis:  Cluster¬ 
ing  With  Data  Bindings,  Dept.  Com.  Scl.,  Unlv.  Maryland,  College 
Park,  Tech.  Rep.  TR-1310,  August  1983. 


[Hwang  Si! 

S-S.  V.  Hwang,  An  Empirical  Study  In  Functional  Testing,'  Structur¬ 
al  Testing,  and  Code  Reading/Inspection*,  Dept.  Com.  Scl.,  Unlv.  of 
Maryland,  College  Park,  Scholarly  Paper  362,  Dec.  1981. 


[IEEE  S3] 

EEEE.  IEEE  Standard  Glossary  of  Software  Engineering  Terminolo¬ 
gy,  Rep.  EEEE- STD-729-1983,  IEEE,  3-12  E.  -17th  St.  New  York. 
1983. 


176 


[Jellnskl  &  Moranda  73] 

Z.  Jellnskl  and  P.  B.  Moranda,  Applications  of  a  Probability-Based 
Model  to  a  Code  Reading  Experiment,  Proc  IEEE  Symposium  on 
Computer  Software  Reliability,  New  York,  pp.  78-81,  IEEE,  1973. 

[Jensen  &  Wlrth  74] 

K.  Jensen  and  N.  Wlrth,  PASCAL  User  Manual  and  Report,  2nd 
Ed.,  Sprlnger-Verlag,  New  York,  1974. 

[Johnson,  Draper  &  Soloway  83] 

W.  L.  Johnson,  S.  Draper,  and  E.  Soloway,  An  Effective  Bug 
Classification  Scheme  Must  Take  the  Programmer  Into  Account, 
Proc.  Workshop  High-Level  Debugging,  Palo  Alto,  CA,  1983. 

[Kelly  82] 

J.  P.  J.  Kelly,  Specification  of  Fault- Tolerant  Multi-Version  Soft¬ 
ware:  Experimental  Studies  of  a  Design  Diversity  Approach,  UCLA 
Ph.D.  Thesis,  1982. 

[Knight  84] 

J.  Knight,  A  Large  Scale  Experiment  In  N-Verslon  Programming, 
Proc.  of  the  Ninth  Annual  Software  Engineering  Workshop, 
NASA/GSFC,  Greenbelt,  MD,  Nov.  1984. 

[Linger,  Mills  &  Witt  79] 

R.  C.  Linger,  H.  D.  Mills,  and  B.  I.  Witt,  Structured  Programming. 
Theory  and  Practice,  Addison- Wesley,  Reading,  MA,  1979. 

[McCabe  70] 

T.  J.  McCabe,  A  Complexity  Measure,  IEEE  Trans.  Software  Engr. 
SE-2,  4,  pp.  308-320,  Dec.  1976. 

[McMullln  &  Gannon  80] 

P.  R.  McMullln  and  J.  D.  Gannon,  Evaluating  a  Data  Abstraction 
Testing  System  Based  on  Formal  Specifications.  Dept.  Com.  Scl., 
Unlv.  of  Maryland.  College  Park,  Tech.  Rep.  TR-993.  Dec.  1980. 

[Mlara  et  al.  83] 

R.  J.  Mlara,  J.  A.  Musselman.  J.  A.  Navarro,  and  B.  Shnelderman. 
Program  Indentation  and  Comprehensibility,  Communications  of  the 
ACM  26.  11,  pp.  861-867,  Nov.  1983. 


[Mills  72a] 


H.  D.  Mills,  Mathematical  Foundations  for  Structural  Programming, 
CBM  Report  FSL  72-0021,  1972. 


[Mills  72b| 

H.  D.  Mills,  Chief  Programmer  Teams:  Principles  and  Procedures, 
IBM  Corp.,  Gaithersburg,  MD,  Rep.  FSC  71-6012,  1972. 


[Mills  75] 

H.  D.  Mills,  How  to  Write  Correct  Programs  and  Know  It,  Int.  Con}, 
on  Reliable  Software,  Los  Angeles,  pp.  363-370,  1975. 

[Moher  &  Schneider  82] 

T.  Moher  and  G.  M.  Schneider,  Methodology  and  Experimental 
Research  in  Software  Engineering,  International  Journal  of  Man- 
Machine  Studies  16,  1,  pp.  65-87,  1982. 

[Musa  75] 

J.  D.  Musa,  A  Theory  of  Software  Reliability  and  Its  Application, 
IEEE  Trans.  Software  Engr.  SE-1,  3,  pp.  312-327,  1975. 

[Myers  76] 

G.  J.  Myers,  Software  Reliability:  Principles  &  Practices,  John  Wiley 
&  Sons,  New  York,  1976. 

[Myers  78] 

G.  J.  Myers,  A  Controlled  Experiment  In  Program  Testing  and  Code 
Walkthroughs/Inspectlons,  Communications  of  the  ACM,  pp.  780- 
788,  Sept.  1978. 

[Myers  79] 

G.  J.  Myers,  The  Art  of  Software  Testing,  John  Wiley  &  Sons,  New 
York,  1979. 

[Naur  69] 

P.  Naur,  Programming  by  Action  Clusters.  BIT  9.  3.  pp.  250-258, 
1969. 

[Ostrand  Sz  Weyuker  83] 

T.  J.  Ostrand  and  E.  J.  Weyuker,  Collecting  and  Categorizing  Soft¬ 
ware  Error  Data  In  an  Industrial  Environment,  Dept.  Com.  Scl.. 
Courant  Inst.  Math.  Scl..  New  York  L'nlv.,  NY'.  Tech.  Rep.  -47.  Au¬ 
gust  1982  (Revised  May  1983). 

i 


[Panzl  81] 


D.  J.  Panzl,  Experience  with  Automatic  Program  Testing,  Proc. 
NBS  Trends  and  Applications,  Nat.  Bureau  Stds.,  Gaithersburg,  MD, 
pp.  25-28,  May,  28  1981. 

[Parnas  72a] 

D.  L.  Parnas,  Some  Conclusions  from  an  Experiment  in  Software  En¬ 
gineering  Techniques,  AFIPS  Proc.  1972  Fall  Joint  Computer  Conf. 
41,  pp.  325-329,  1972. 

[Parnas  72b] 

D.  L.  Parnas,  On  the  Criteria  to  be  Used  In  Decomposing  Systems 
Into  Modules,  Communications  of  the  ACM  15,  12,  pp.  1053-1058, 
1972. 

[Parnas  72c] 

D.  L.  Parnas,  A  Technique  for  Module  Specification  With  Examples, 
Communications  of  the  ACM  15,  May  1972. 

[Ramsey  84] 

J.  Ramsey,  Structural  Coverage  of  Functional  Testing,  Seventh  Min- 
nowbrook  Workshop  on  Software  Performance  Evaluation,  Blue 
Mountain  Lake,  NY,  July  24-27,  1984. 

[SEL  82] ' 

Annotated  Bibliography  of  Software  Engineering  Laboratory  (SEL) 
Literature,  Software  Eng.  Lab.,  NASA/ Goddard  Space  Flight 
Center,  Greenbelt,  MD  Rep.  SEL-82-006,  Nov.  1982. 

[Selby  83] 

R.  W.  Selby,  Jr.,  An  Empirical  Study  Comparing  Software  Testing 
Techniques,  Sixth  Minnowbrook  Workshop  on  Software  Performance 
Evaluation,  Blue  Mountain  Lake,  NY,  July  19-22,  19S3. 

[Selby  84] 

R.  W.  Selby,  Jr..  Evaluating  Software  Testing  Strategies.  Proc.  of 
the  Ninth  Annual  Software  Engineering  Workshop ,  NASA/GSFC. 
Greenbelt,  MD,  Nov.  1984. 

[Selby.  Baslll  Baker  85] 

R.  W.  Selby,  Jr.,  V.  R.  Baslll.  and  F.  T.  Faker.  CLEAXROOM  Soft¬ 
ware  Development:  An  Empirical  [.valuation.  Dept.  Com.  Sol..  Unlv. 
Maryland,  College  Park,  Tech.  Rep.  TR-1415,  February  1985.  (sub¬ 
mitted  to  the  IEEE  Trans  Software  Engr  ) 


[Shankar  82] 

K.  S.  Shankar,  A  Functional  Approach  to  Module  Verification,  IEEE 
Trans.  Software  Engr.  SE-8,  2,  March  1982. 

[Shell  81] 

B.  A.  Shell,  The  Psychological  Study  of  Programming,  Computing 
Surveys  13,  pp.  101-120,  March  1981. 

[Shen,  Conte  &  Dunsmore  83] 

V.  Y.  Shen,  S.  D.  Conte,  and  H.  E.  Dunsmore,  Software  Science  Re¬ 
visited:  A  Critical  Evaluation  of  the  Theory  and  Its  Empirical  Sup¬ 
port,  Trans.  Software  Engr.  SE-9,  2,  pp.  155-165,  March  1983. 

[Shnelderman  et  al.  77] 

B.  Shnelderman,  R.  E.  Mayer,  D.  McKay,  and  P.  Heller,  Experimen¬ 
tal  Investigations  of  the  Utility  of  Detailed  Flowcharts  In  Program¬ 
ming,  Communications  of  the  ACM  20,  6,  pp.  373-381,  1977. 


[Siegel  55] 

S.  Siegel,  Nonparametric  Statistics  for  the  Behavioral  Sciences, 
McGraw-Hill,  New  York,  1955. 

[Soloway  et  al.  82] 

E.  Solowav,  K.  Ehrlich,  J.  Bonar,  and  J.  Greenspan,  What  Do  No¬ 
vices  Know  About  Programming?,  In  Directions  in  Human-Computer 
Interactions ,  ed.  A.  Badre  and  B.  Shnelderman,  Ablex,  Inc.,  1982. 

[Soloway  83] 

E.  Soloway,  You  Can  Observe  a  Lot  by  Just  Watching  How 
Designers  Design,  Proc.  Eight  Ann.  Software  Engr.  Workshop, 
NASA/GSFC,  Greenbelt,  MD,  Nov.  1983. 

[Soloway  &  Ehrlich  84] 

E.  Soloway  and  K.  Ehrlich,  Empirical  Studies  of  Programming 
Knowledge.  Trans.  Software  Engr.  SE-10.  5.  pp.  595-609,  Sept. 
1984. 


[Stuckl  77] 

L.  G.  Stuckl,  New  Directions  In  Automated  Tools  for  Improving 
Software  Quality,  In  Current  Trends  in  Programming  Methodology. 
ed.  R.  T.  Yeh,  Prentice  Hall.  Englewood  Cliffs,  NJ.  1977. 


180 


[Thayer,  Llpow  &  Nelson  78j 

R.  A.  Thayer,  M.  Llpow,  and  E.  C.  Nelson,  Software  Reliability , 
North-Hoiland,  Amsterdam,  1978. 

[Valdes  Sc  Goel  83] 

P.  M.  Valdes  and  A.  L.  Goel,  An  Error-Specific  Approach  to  Testing, 
Proc.  Eight  Ann.  Software  Engr.  Workshop,  NASA/GSFC,  Green- 
belt,  MD,  Nov.  1983. 

[Vessey  Sc  Weber  83] 

1.  Vessey  and  R.  Weber,  Some  Factors  .Affecting  Program  Repair 
Maintenance:  An  Empirical  Study,  Communications  of  the  ACM  26, 

2,  pp.  128-13-1,  Feb.  1983. 

[Vosburgh  et  al.  8-1] 

J.  Vosburgh.  B.  Curtis,  R.  Wolverton,  B.  Albert,  H.  Malec,  S. 
Hoben.  and  Y.  Liu,  Productivity  Factors  and  Programming  Environ¬ 
ments,  Proc  Seventh  Int.  Conf.  Software  Engr.,  Orlando,  FL,  pp. 
113-152,  1981. 

[Walston  Sc  Felix  77] 

C.  E.  Walston  and  C.  P.  Felix,  A  Method  of  Programming  Measure¬ 
ment  and  Estimation,  IBM  Systems  J  16,  l,  pp.  51-73,  1977. 

[Weiss  Sc  Baslll  85] 

D.  M.  Weiss  and  V.  R.  Baslll.  Evaluating  Software  Development  by 
.Analysis  of  Changes:  Some  Data  from  the  Software  Engineering  La¬ 
boratory,  IEEE  Trans.  Software  Engr.  SE-11,  2,  pp.  157-168, 
February  1985. 

[Welssman  7l] 

L.  Welssman,  Psychological  Complexity  of  Computer  Programs:  An 
Experimental  Methodology,  SIGPLAN  Notices  9,  6,  pp.  25  -  36, 
June  197-4. 

[Woodfleld,  Dunsmore  Sc  Shen  Si] 

S.  N.  Woodfleld.  H.  E.  Dunsmore.  and  V.  Y.  Shen.  The  Effect  of 
Modularization  and  Comments  on  Program  Comprehension.  Dept. 
Com.  icl.,  Arizona  St.  Unlv.,  Tempe,  AZ.  working  paper.  1981. 

[Zolnowskl  Sc  Simmons  Si’ 

J.  C.  Zolnowskl  and  D.  B.  Simmons.  Taking  the  Measure  of  Program 
Complexity,  Proc.  National  Computer  Conference,  pp.  329-336. 


REPORT  DOCUMENTATION  PAGE 


4.  TITLE  (and  Subtitle) 


EVALUATIONS  OF  SOFTWARE  TECHNOLOGIES: 
Testing,  CLEAN ROOM,  and  Metrics 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


RECIPIENT'S  CATALOG  NUMBER 


5  type  of  report  &  perico  covered 


Technical  Report 


6  performing  org.  report  number 


7.  AUTHOR!",; 


Richard  W.  Selby,  Jr. 


8.  CONTRACT  OR  GRANT  NUMBER'!) 


F  49620-80-C-001 


9  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Department  of  Computer  Science 
University  of  Maryland 
College  Park,  Maryland  20742 


It.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Math.  4  Info.  Sciences,  AFOSR 
Bolling  AFB  13  number  of  pages 

Washington,  D.  C.  20332  _ 


12.  REPORT  DATE  f 

May  1985 


'4  MONITORING  AGENCY  NAME  A  ADDRESS'//  different  front  Controlling  Office)  15  SECURITY  CLASS,  fof  this  report 

UNCLASSIFIED 


15 a.  DECLASSIFICATION  DOWNGRADING 
SCHEDULE 


16-  DISTRIBUTION  STATEMENT  (of  this  Report) 


Approved  for  public  release;  distribution  unlimited 


17  DISTRIBUTION  STATEMENT  (of  the  abstract  entered  in  Block  20,  it  different  from  Report) 


w  e  v  *3R0$  'Continue  on  reverse  side  if  nec  e  s  sarv  and  identify  bv  block  number > 


ZZ  ABSTRACT  'Continue  on  reverse  side  If  necessary  and  identify  bv  block  number  The  eValuat  ion  of  SOt  tVClTC 

technologies  suffers  because  of  the  lack  of  quantitative  assessment  of  their 
effect  on  software  development  and  modification.  A  seven-step  approach  for 
quantitatively  evaluating  software  technologies  couples  software  methodology 
evaluation  with  software  measurement.  The  approach  is  applied  in-depth  in 
tiie  following  three  areas.  1)  Software  Testing  Strategies:  A  74-subject  studv, 
including  32  professional  programmers  and  42  advanced  university  students, 
compared  code  reading,  functional  testing,  and  structural  testing  in  a 
fractional  factorial  design.  2)  Cleanroom  Software  Development:  Fifteen  three- 


tarn 


SECURITY  CLASSIFICATION  OF  THIS  PAGEf»7>«i  Data  F.ntarad) 


person  teams  separately  built  a  1200-line  message  system  to  compare  Cleanroom 
software  development  (in  which  software  is  developed  completely  off-line) 
with  a  more  traditional  approach.  3)  Characteristic  Software  Metric  Sets: 

In  the  NASA  S.E.L.  production  environment,  a  study  of  65  candidate  product 
and  process  measures  of  652  modules  from  six  (51,000  -  112,000  line)  projects 
yielded  a  characteristic  set  of  software  cost/quality  metrics. 

The  major  results  are  the  following.  1)  The  approach  described  for  quanti¬ 
tatively  evaluating  software  technologies  has  been  demonstrated  and  effective 
in  a  variety  of  problem  domains.  2)  With  the  professionals,  code  reading 
detected  more  software  faults  and  had  a  higher  fault  detection  rate  than  did 
functional  or  structural  testing,  while  functional  testing  detected  more 
faults  than  did  structural  testing,  but  functional  and  structural  testing  were 
not  different  in  fault  detection  rate.  3)  With  the  students,  the  three 
techniques  were  not  different  in  the  number  of  faults  detected  or  in  the 
fault  detection  rate,  except  that  structural  testing  detected  fewer  faults  than 
did  the  others  in  one  study  phase.  4)  Code  reading  detected  more  interface 
faults  and  functional  testing  detected  more  control  faults  than  did  the  other 
methods.  5)  Most  developers  using  the  Cleanroom  software  development  approach 
were  able  to  build  systems  completely  off-line.  6)  The  Cleanroom  teams'  product  5 
met  svstem  requirements  more  completely  and  succeeded  on  more  operational 
test  cases  than  did  those  developed  with  a  traditional  approach.  7)  An  approach 
described  for  calculating  a  characteristic  metric  set  yielded  the  set  for  tne 
NASA  S.E.L.  environment  (source  lines,  design  effort,  number  of  input/output 
parameters,  fault  correction  effort  per  executable  statement,  code  effort, 
number  of  versions). 


